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Chapter 1: Stochastic Processes ™""" | 


What are Stochastic Processes, and how do they fit in? 


STATS 310 
Statistics 


STATS 210 


Foundations of 
Statistics and Probability 


Randomness in Pattern 


Tools for understanding randomness 


(random variables, distributions) STATS 325 


Probability 


Randomness in Process 


Stats 210: laid the foundations of both Statistics and Probability: the tools for 
understanding randomness. 


Stats 310: develops the theory for understanding randomness in pattern: tools 
for estimating parameters (maximum likelihood), testing hypotheses, modelling 
patterns in data (regression models). 


Stats 325: develops the theory for understanding randomness in process. A 
process is a sequence of events where each step follows from the last after a 
random choice. 


What sort of problems will we cover in Stats 325? 


Here are some examples of the sorts of problems that we study in this course. 


Gambler’s Ruin 


You start with $30 and toss a fair coin 
repeatedly. Every time you throw a Head, you 
win $5. Every time you throw a Tail, you lose 
$5. You will stop when you reach $100 or when 
you lose everything. What is the probability that 
you lose everything? Answer: 70%. 


Winning at tennis 


What is your probability of winning a game of tennis, 
starting from the even score Deuce (40-40), if your 


probability of winning each point is 0.3 and your 


opponent’s is 0.7? 


Answer: 15%. 


Winning a lottery 


Pp VENUS Py VENUS 
AHEAD (A) WINS (W) 


DEUCE (D) 


q VENUS 
BEHIND (B) | q LOSES (L) 


A million people have bought tickets for the weekly lottery 
draw. Each person has a probability of one-in-a-million 

of selecting the winning numbers. If more than one person 
selects the winning numbers, the winner will be chosen 

at random from all those with matching numbers. 


You watch the lottery draw on TV and your numbers match the winning num- 
bers!!!| Only a one-in-a-million chance, and there were only a million players, 
so surely you will win the prize? 


Not quite... What is the probability you will win? Answer: only 63%. 


Drunkard’s walk 


A very drunk person staggers to left and right as he walks along. With each 


step he takes, he 


staggers one pace to the left with probability 0.5, and one 


pace to the right with probability 0.5. What is the expected number of paces 
he must take before he ends up one pace to the left of his starting point? 


= ! 


Answer: the expectation is infinite! 
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Pyramid selling schemes 


Have you received a chain letter like this one? Just send $10 to the person 
whose name comes at the top of the list, and add your own name to the bottom 
of the list. Send the letter to as many people as you can. Within a few months 
the letter promises, you will have received $77,000 in $10 notes! Will you? | 


[TWAS AMAZED WHEN I SAW HOW MUCH MONEY CAME 
FLOODING THROUGH MY LETTER BOX...[ TURNED $218 
INTO $78190 WITHIN THE FIRST 80 DAYS OF OPERATING 
‘THIS BUSINESS PLAN : 


DO NOT BIN THIS IMMEDIATELY 
THINK ABOUT IT FOR A FEW DAYS 
FILE IN PENDING 


My same is David Rhodes and in September 1997 J lost my job. At the time L 
was living at the edge of my means aud in debt, Consequently, ibis started a 


chain reaction that ended with the repossession of my home and car, [f that ee 
wasn’t enough several debt collectors were constantly houndi : ; 
imagine iife Iooked beak. THIS’ IS HOW THE SYSTEM WORKS 


WITHIN 60 DAYS 
Yu January 1998 I received a letter telling me how te make oy: : : : 
time, I ignored it because T was sceptical. However by March Vou-have sent off. we + : oi tot 
4 = -yonr $10 cote then mailed 200 letters (minimum) your details are 
{rete a ee ye abroutely nothing ts Ise printed af NGS on each'of thems Your Ease sre now:complee. Sit back and relax- you 
apart from that, I couldn’t stop myse! ig i desorve its 
‘fn the summer of 1999 my family and T went on a cruise and | 
new Mercedes with cash and we are enrrently building our ¥ 
home and F don’t owe a single cent. 


To date I have made ovek $1,100,000) ven now as I write th; pf onty 3% of 1200 people respond to Your letter, 36 people will mail 200 letters each.= 
it hard to come to terms Wi Cf that tke most people, I 7,200 letters with your name at No3 


Tronly 3% of 200 peosle respond-te your-letier; 6 peools will mail 200 letter each =1200 
letters with your name at.Nod.” 


{Ie only 3% of 7,200 people respond ta yourktter, 216 people wil mail 208 letters each 
=$43;200 letters with vevr name at Nod. . 


ifonly 3% 01 43,200 people respond:to. your letter, 1296 pesp)e will mail. 206, letters cach 
= 259,200 letters with your name at Nol. 


tif only 3% of 259,200 people respond ‘to thcir letters 7,776 peoplewill-send.you.Si0 each, 
because your wane is'nt Not position therefore you will receive 


$77,760.00 in $10 notes 


Answer: it depends upon the response rate. However, with a fairly realistic 


assumption about response rate, we can calculate an expected return of $76 
with a 64% chance of getting nothing! 


Note: Pyramid selling schemes like this are prohibited under the Fair Trading Act, 
and it is illegal to participate in them. 


The figure to the right shows the spread 
of the disease SARS (Severe Acute 


Respiratory Syndrome) through Singapore fee LULL SLE LLL Lanny 


in 2003. With this pattern of infections, 


what is the probability that the disease ou 


eventually dies out of its own accord? eee $44444 f f f $h404 f f + 


Answer: 0.997. t- 
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Markov’s Marvellous Mystery Tours 


Mr Markov’s Marvellous Mystery Tours promises an All-Stochastic Tourist Ex- 
perience for the town of Rotorua. Mr Markov has eight tourist attractions, to 
which he will take his clients completely at random with the probabilities shown 
below. He promises at least three exciting attractions per tour, ending at either 
the Lady Knox Geyser or the Tarawera Volcano. (Unfortunately he makes no 
mention of how the hapless tourist might get home from these places.) 


What is the expected number of activities for a tour starting from the museum? 


1/3 


2. Cruise 4. Flying Fox 


[6 Geyser |)! 


1/3 


8. Volcano = ; 


7. Helicopter 


Answer: 4.2. 


Structure of the course 


e Probability. Probability and random variables, with special focus on 
conditional probability. Finding hitting probabilities for stochastic pro- 
cesses. 


e Expectation. Expectation and variance. Introduction to conditional ex- 
pectation, and its application in finding expected reaching times in stochas- 
tic processes. 


e Generating functions. Introduction to probability generating func- 
tions, and their applications to stochastic processes, especially the Random 
Walk. 


e Branching process. This process is a simple model for reproduction. 
Examples are the pyramid selling scheme and the spread of SARS above. 
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e Markov chains. Almost all the examples we look at throughout the 
course can be formulated as Markov chains. By developing a single unify- 
ing theory, we can easily tackle complex problems with many states and 
transitions like Markov’s Marvellous Mystery Tours above. 


The rest of this chapter covers: 
e quick revision of sample spaces and random variables; 


e formal definition of stochastic processes. 


1.1 Revision: Sample spaces and random variables 


Definition: A random experiment is a physical situation whose outcome cannot 
be predicted until it is observed. 


Definition: A sample space, {), is a set of possible outcomes of a random experi- 
ment. 


Example: 
Random experiment: Toss a coin once. 
Sample space: Q ={head, tail} 


Definition: A random variable, X, is defined as a function from the sample space 
to the real numbers: X :Q—-> R. 


That is, a random variable assigns a real number to every possible outcome of a 
random experiment. 


Example: 
Random experiment: Toss a coin once. 
Sample space: Q = {head, tail}. 
An example of a random variable: X :(Q.— R maps “head” —> 1, “tail” > 0. 


Essential point: | A random variable is a way of producing random real numbers. 
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1.2 Stochastic Processes 


Definition: A stochastic process is a family of random variables, 
{X(t) : t © T}, where t usually denotes time. That is, at every time t in the set 
T, arandom number X(t) is observed. 


Definition: {X(t) : t € T} is a discrete-time process if the set T is finite or 
countable. 
In practice, this generally means T = {0,1,2,3,...} 
Thus a discrete-time process is {X (0), X(1), X (2), X(3),...}: a random number 


associated with every time 0, 1, 2, 3, ... 


Definition: {X(t) : t € T} is a continuous-time process if T is not finite or 
countable. 
In practice, this generally means T = [0, 00), orT = [0, K] for some Kk. 


Thus a continuous-time process {X(t) : t € T} has a random number X(t) 
associated with every instant in time. 


(Note that X(t) need not change at every instant in time, but it is allowed to 
change at any time; i.e. not just at t = 0,1,2,..., like a discrete-time process.) 


Definition: The state space, S, is the set of real values that X(t) can take. 


Every X(t) takes a value in R, but S' will often be a smaller set: S C R. For 
example, if X(t) is the outcome of a coin tossed at time t, then the state space 
ig 6 =40, 1}, 


Definition: The state space S' is discrete if it is finite or countable. 
Otherwise it is continuous. 


The state space S is the set of states that the stochastic process can be in. 
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For Reference: Discrete Random Variables 


1. Binomial distribution 
Notation: X ~ Binomial(n, p). 


Description: number of successes in n independent trials, each with proba- 
bility p of success. 


Probability function: 


jx) =P (4 =2) = ("pra =p fore =O, Ds ou. Me 
x 
Mean: E(X) = np. 
Variance: Var(X) = np(1— p) = npq, where q = 1 — p. 


Sum: If X ~ Binomial(n, p), Y ~ Binomial(m,p), and X and Y are 
independent, then 


X+Y~ Bin(n+™m, p). 


2. Poisson distribution 
Notation: X ~ Poisson(,). 


Description: arises out of the Poisson process as the number of events in a 
fixed time or space, when events occur at a constant average rate. Also 
used in many other situations. 


Probability function: f(x) = P(X =x%)=—e™ for #«=0,1,2,... 
pb ine Scatraras Mics uae ait ad i 

Mean: E(X) = X. 

Variance: Var(X) =). 


Sum: If X ~ Poisson(\), Y ~ Poisson(j1), and X and Y are independent, 
then 
X +Y ~ Poisson(\ + 4). 
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3. Geometric distribution 
Notation: X ~ Geometric(p). 


Description: number of failures before the first success in a sequence of in- 
dependent trials, each with P(success) = p. 


Probability function: fx(z)=P(X =z)=(1—p)*p for x«=0,1,2,... 


{= 
Mean: E(X) = eae ee a where g =1-—p. 
P Pp 
lp 


Variance: Var(X) = —,; 
Pp 


qd 
= -—, where g=1-—p. 
D 


Sum: if X,,...,X;, are independent, and each X; ~ Geometric(p), then 


Xi +...+X, ~ Negative Binomial(k, p). 
4. Negative Binomial distribution 


Notation: X ~ NegBin(k, p). 


Description: number of failures before the kth success in a sequence of in- 
dependent trials, each with P(success) = p. 


Probability function: 


k —l 
fx (0) = P(X = 2) = ( oe )rta-p for S012 hae 
if 
ies 
Mean: E(X) = KP= 2) = uy where gq = 1-—p. 
Pp Pp 

_ 
Variance: Var(X) = i = es where g = 1—p. 

Pp Pp 
Sum: If X ~ NegBin(k, p), Y ~ NegBin(m, p), and_X and Y are independent, 

then 


X+/Y ~ NegBin(k +m, p). 
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5. Hypergeometric distribution 
Notation: X ~ Hypergeometric(V, M, 7). 


Description: Sampling without replacement from a finite population. Given 
N objects, of which MW are ‘special’. Draw n objects without replacement. 
X is the number of the n objects that are ‘special’. 


Probability function: 
(") Gas 
fx(a) = P(X = 2) = “+ for ‘ 


e 


M 
N- 


x = max(0, n+ M —N) 
to «=min(n, M). 


Mean: E(X) = np, where p = 


N- 
N- 


, M 
Variance: Var(X) = np(1— )( =); where p = Wr 


6. Multinomial distribution 


Notation: X = (Xj,...,X%) ~ Multinomial(n; pi, po,..., pr). 


Description: there are n independent trials, each with k possible outcomes. 
Let p; = P(outcome i) fori =1,...k. Then X = (Xj,..., Xx), where X; 
is the number of trials with outcome 7, for 2 = 1,...,k. 


Probability function: 


n! 


fx(@) =P(4 = 21,...,X_ = 2x) = ——— 7's’... Dy" 
x1!... 0%! 
k k 
for x; € {0,...,n} V; with Sai =n, and where p; > 0 Vj, yi = 1: 
i=l i=1 


Marginal distributions: X; ~ Binomial(n,p;) for i= 1,...,k. 
Mean: E(X;) = np; fori =1,...,k. 

Variance: Var(X;) = np;(1—p;), fori =1,...,k. 

Covariance: cov(X;,X;) = —np;pj;, for all i ¥ j. 
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Continuous Random Variables 
1. Uniform distribution 
Notation: X ~ Uniform(a, }). 


1 
Probability density function (pdf): fx(x) = boa fora<a<b. 
ee  - — a 


Cumulative distribution function: 


Fx(2) = P(X $2) => — lor B= a= 0, 


Pyle) =0 tora <a, and Fy(z) = 1 tor ¢ > 0. 


Mean: E(X) = ee 


(ba)? 


Variance: Var(X) = E 


2. Exponential distribution 


Notation: X ~ Exponential()). 
Probability density function (pdf): fx(x) = Ae** for 0 < 4 < oo. 


Cumulative distribution function: 
Fy(z) =P(X <z)=1-—e%* for 0<2< 00. 
F(x) =0 for x < 0. 


1 
Mean: E(X) = 7 
; 1 
Variance: Var(X) = yp 


Sum: if Xi,...,X, are independent, and each X; ~ Exponential(A), then 
Xi +...+X, ~ Gamma(k, ). 
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3. Gamma distribution 
Notation: X ~ Gamma(k, A). 
Probability density function (pdf): 
ix(2)= A gh-Te-he for 0 <a co 
where '(k) = [,° y*te-¥ dy (the Gamma function). 


Cumulative distribution function: no closed form. 


Mean: E(X) = . 
: ; k 
Variance: Var(X) = vh 


Sum: if X;,...,X, are independent, and X; ~ Gamma(k;, A), then 
Xi +... +X, ~ Gamma(k; + ...+ kp, ). 


4. Normal distribution 


Notation: X ~ Normal(, 07). 
Probability density function (pdf): 


1 2 2 
ix(2)= et (@—-#)"/20°} for — 00 < & < OO. 


V 2102 


Cumulative distribution function: no closed form. 


Mean: E(X) = p. 

Variance: Var(X) = 07. 

Sum: if X,...,X, are independent, and X; ~ Normal(1;, o?), then 
Xi t+... +X, ~ Normal(py +... + pn, re cr 
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Probability Density Functions 


fx(x) 
Uniform(a, b) 


1 
b-a 


Exponential(A) 


Gamma(k, ) 


Normal(p, a) 
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Chapter 2: Probability 


The aim of this chapter is to revise the basic rules of probability. By the end 
of this chapter, you should be comfortable with: 


e conditional probability, and what you can and can’t do with conditional 
expressions; 


e the Partition Theorem and Bayes’ Theorem; 


e First-Step Analysis for finding the probability that a process reaches some 
state, by conditioning on the outcome of the first step; 


e calculating probabilities for continuous and discrete random variables. 


2.1 Sample spaces and events 


Definition: A sample space, 1), is a set of possible outcomes of a random 
experiment. 


Definition: An event, A, is a subset of the sample space. 


This means that event A is simply a collection of outcomes. 


Example: 


Random experiment: Pick a person in this class at random. 
Sample space: 2 = {all people in class} 
Event A: A = {all males in class}. 


Definition: Event A occurs if the outcome of the random experiment is a member 
of the set A. 


In the example above, event A occurs if the person we pick is male. 
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2.2 Probability Reference List 


The following properties hold for all events A, B. 

e P(0) =0. 

e 0< P(A) <1. 

e Complement: P(A) = 1-— P(A). 

e Probability of a union: P(AU B) = P(A) + P(B) —P(ANB). 
For three events A, B, C: 


P(AUBUC) = P(A)+P(B)+P(C)—P(ANB)—P(ANC)—P(BnC)+P(ANBNC) . 


If A and B are mutually exclusive, then P(AU B) = P(A) + P(B). 
P(AN B) 

P(B) 
e Multiplication rule: P(AN B) = P(A| B)P(B) = P(B| A)P(A). 


e Conditional probability: P(A|B) = 


e The Partition Theorem: if B,, Bo,..., 8B, form a partition of 2, t 


m m 


hen 


P(A) = 5 P(AN Bi) = 5 P(A| Bi)P(B;) for any event A. 


i=1 i=1 
As a special case, B and B partition 2, so: 
P(A) = P(ANB)+P(ANB) 
= P(A|B)P(B)+P(A|B)P(B) for any A, B. 
P(A| B)P(B) 
P(A) 
More generally, if B,, Bo,...,B, form a partition of 2, then 


P(A | Bj) P(B;) . 
PLB; |A) = =" P(A| BPE) 'B)P(B) for any 7. 


e Bayes’ Theorem: P(B| A) = 


e Chains of events: for any events A), Ao,..., An, 


P(A,MAgN...A_) = P(A,)P(Ap | A1)P(A3| A241) ...P(An| Ani 


/..A}). 


2.3 Conditional Probability 


Suppose we are working with sample space 
Q = {people in class}. I want to find the 
proportion of people in the class who ski. What do I do? 


Count up the number of people in the class who ski, and divide by the total 
number of people in the class. 


number of skiers in class 


(person skis) total number of people in class 


Now suppose I want to find the proportion of females in the class who ski. 
What do I do? 


Count up the number of females in the class who ski, and divide by the total 
number of females in the class. 


number of female skiers in class 


(female skis) = enact nae 
female sits) total number of females in class 


By changing from asking about everyone to asking about females only, we have: 


e restricted attention to the set of females only, 
or: reduced the sample space from the set of everyone to the set of females, 
or: conditioned on the event {females}. 
We could write the above as: 


number of female skiers in class 


P(skis | female) = ——_——__—_____________. 
(skis | female) total number of females in class 


Conditioning is like changing the sample space: we are now working in 


a new sample space of females in class. 


In the above example, we could replace ‘skiing’ with any attribute B. We have: 


# skiers in class # female skiers in class 


P(skis) = P(skis | female) = 
os # Class a) # females in class 
sO: 
P(B) = 4 B's 10 = | 
total # people in class 
and: 


# female B’s in class 
PB | female) = ——__—_______—= 
a) total # females in class 


__ # in class who are B and female 


# in class who are female 


Likewise, we could replace ‘female’ with any attribute A: 


P(B| A) = number in class who are B and A 


number in class who are A 


This is how we get the definition of conditional probability: 


_ P(Band A) P(BN A) 
PBIA= Ba) PA) 


By conditioning on event A, we have changed the sample space to the set of A’s 
only. 


Definition: Let A and B be events on the same sample space: so A C QandB CQ. 
The conditional probability of event B, given event A, is 


P(BNA) 


P(B| A) = op 
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Multiplication Rule: (Immediate from above). For any events A and B, 


P(AN B) = P(A| B)P(B) = P(B| A)P(A) = P(BN A). 


Conditioning as ‘changing the sample space’ 


The idea that “conditioning” = “changing the sample space” can be very helpful 
in understanding how to manipulate conditional probabilities. 


Any ‘unconditional’ probability can be written as a conditional probability: 
PB) = PB |). 


Writing P(B) = P(B|Q) just means that we are looking for the probability of 
event B, out of all possible outcomes in the set {. 


In fact, the symbol P belongs to the set 2): it has no meaning without {). 
To remind ourselves of this, we can write 


P = Py. 
Then P(B) = P(B|Q) = P,(B). 


Similarly, P(B|A) means that we are looking for the probability of event B, 
out of all possible outcomes in the set A. 


So A is just another sample space. Thus we can manipulate conditional proba- 
bilities P(- | A) just like any other probabilities, as long as we always stay inside 
the same sample space A. 


The trick: Because we can think of A as just another sample space, let’s write 


P(-| A) =P,(-) Note: NOT 
standard notation! 


Then we can use P, just like P, as long as we remember to keep the 
A subscript on EVERY P that we write. 
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This helps us to make quite complex manipulations of conditional probabilities 
without thinking too hard or making mistakes. There is only one rule you need 
to learn to use this tool effectively: 


P,(B|C) =P(B|CNA) for any A, B,C. 


(Proof: Exercise). 


The rules: P(-|A) = P,(-) 


P,(B|C) =P(B|CNA) for any A, B, C. 


Examples: 


1. Probability of a union. In general, 
P(BUC) =P(B)+P(C) -P(BNC). 


Thus, P(BUC|A) =P(B|A)+P(C|A)-P(BNC|A). 


2. Which of the following is equal to P)BNC| A)? 


(a) P(B|CN A). (c) P(B|CN A)P(C| A). 
P(B|C) 
Oe rim (d) P(B|C)P(C| A). 


Solution: 


P(BNC|A) = P,(BNC) 
P,(B|C) Pa(C) 


P(B|CN A)P(C| A). 


Thus the correct answer Is (Cc). 


Exercise: 


. Which of the following is true? 


(a) P(B| A) =1—P(B| A). (b) P(B| A) = P(B) — P(B| A). 


Solution: 
P(B|A) =P,(B) =1—P,(B) =1-—P(B|A). 


Thus the correct answer is (a). 


. Which of the following is true? 


(a) P(BN A) =P(A)—P(BN A). (b) P(BN A) = P(B) — 
Solution: 
P(BO A) =P(B|A)P(A) = P,(B)P(A) 
= (1-P,(B))P(A) 
= P(A) — P(B| A)P(A) 
P(A) — P(BN A). 


Thus the correct answer is (a). 


. True or false: P(B| A) = 1—P(B| A)? 


Answer: False. P(B| A) = P,(B). Once we have P,, we are stuck with it! 
There 1s no easy way of converting from P, to Pz: or anything else. Probabilities 
in one sample space (P,) cannot tell us anything about probabilities in a different 


sample space (Pz). 


P(B) — P(B| A)P(A) 


ee 1— P(A) 


if we wish to express P(B|A) in terms of only B and A, show that 
Note that this does not simplify nicely! 
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2.4 The Partition Theorem (Law of Total Probability) 


Definition: Events A and B are mutually exclusive, or disjoint, if AN B= 9. 


This means events A and B cannot happen together. If A happens, it excludes 
B from happening, and vice-versa. 


© © 


If A and B are mutually exclusive, P(A U B) = P(A) + ae ). 
For all other A and B, P(AU B) = P(A) + P(B) — P(ANB). 


Q 


Definition: Any number of events B,, Bo,...,B, are mutually exclusive if every 
pair of the events is mutually exclusive: ie. B; MB; = 0 for alli, 7 withi F j. 
Q 


Definition: A partition of 2 is a_ collection of mutually exclusive events whose 
union 1s Q). 


That is, sets B,, Bo,..., By, form a partition of Q if 


BAB; = @ for all i,j with 147, 


and JB — B,UBU...UB = Q. 


B,,..., By, form a partition of 2 if they have no overlap 


and collectively cover all possible outcomes. 
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Examples: 
Q Q 
By, Bz By Bg 
B, | Bs By Be Bs 
\ 
Q 


ee] 


Partitioning an event A 


Any set A can be partitioned: it doesn’t have to be 2. 
In particular, if By,...,B, form a partition of 2, then (AN B;),...,(AM Bz) 
form a partition of A. 


Q 


Theorem 2.4: The Partition Theorem (Law of Total Probability) 


Let B,,..., By form a partition of 2. Then for any event A, 


Both formulations of the Partition Theorem are very widely used, but especially 
the conditional formulation 57", P(A| B;)P(Bi). 
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Intuition behind the Partition Theorem: 


The Partition Theorem is easy to understand because it simply states that “the 
whole is the sum of its parts.” 


ANB, AN Bog 


| ron 4 
eK 
| / 


AN B3 AN By 


2.5 Bayes’ Theorem: inverting conditional probabilities 


Bayes’ ‘Theorem allows us to “invert” a conditional statement, ie. to express 
P(B| A) in terms of P(A| B). 


Theorem 2.5: Bayes’ Theorem 


For any events A and B: 


Proof: 


P(BNA) 


P(AN B) 
P(B|A)P(A) = P(A|B)P(B) (multiplication rule) 


P(A| B)P(B) 


P(B|A) = Erp 
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Extension of Bayes’ Theorem 


Suppose that B,, Bo,...,B, form a partition of 2. By the Partition Theorem, 


m 


P(A) = > P(A | B;)P(Bi). 


w=1 


Thus, for any single partition member B;, put B = B; in Bayes’ Theorem 
to obtain: 


A|B,)P(B;)__—-P(A| B,)P(B;) 
P(A) DS P(A| Bi) P(Bi) 


Special case: m = 2 


Given any event B, the events B and B form a partition of Q. Thus: 


ee P(A|B)P(B)_ 


P(A| B)P(B) + P(A| B)P(B) 


Example: In screening for a certain disease, the probability that a healthy person 
wrongly gets a positive result is 0.05. The probability that a diseased person 
wrongly gets a negative result is 0.002. The overall rate of the disease in the 
population being screened is 1%. If my test gives a positive result, what is the 
probability I actually have the disease? 


__NEW ZEALAND 


Pie 


1. Define events: 
D = {have disease} D = {do not have the disease} 
P = {positive test} N = P = {negative test} 


2. Information given: 


False positive rate is 0.05 = P(P|D)=0.05 
False negative rate is 0.002 = P(N|D) = 0.002 
Disease rate is 1% = P(D)=0.01. 


3. Looking for P(D | P): 


P(P|D)P(D 
Wehave P(D|P) = ae. 
Now  P(P|D) = 1—P(P|D) 
~ 1—P(N|D) 
— 1—0.002 
— 0.998. 
Also P(P) = P(P|D)P(D)+P(P|D)P() 
= 0.998 x 0.01 + 0.05 x (1 — 0.01) 
— 0.05948. 
ate 0.998 x 0.01 
P(D| P) = ——>—_— = 0.168. 


Given a positive test, my chance of having the disease is only 16.8%. 


2.6 First-Step Analysis for calculating probabilities in a process 


In a stochastic process, what happens at the next step depends upon the cur- 
rent state of the process. We often wish to know the probability of eventually 
reaching some particular state, given our current position. 


Throughout this course, we will tackle this sort of problem using a technique 
called First-Step Analysis. 


The idea is to consider all possible first steps away from the current state. We 
derive a system of equations that specify the probability of the eventual outcome 
given each of the possible first steps. We then try to solve these equations for 
the probability of interest. 


First-Step Analysis depends upon conditional probability and the Partition 
Theorem. Let 51,..., 5% be the k possible first steps we can take away from our 
current state. We wish to find the probability that event EF’ happens eventually. 
First-Step Analysis calculates P() as follows: 


P(E) = P(E|S\)P(S,) +... + P(E|S,)P(Si). 


Here, P(S;),...,P(S;) give the probabilities of taking the different first steps 
1,2,...,k. 


Example: ‘Tennis game at Deuce. 


Venus and Serena are playing tennis, and have reached 
the score Deuce (40-40). (Deuce comes from the French 
word Deux for ‘two’, meaning that each player needs to win two consecutive 
points to win the game.) 


For each point, let: 
p = P(Venus wins point), q = 1—p = P(Serena wins point). 
Assume that all points are independent. 


Let v be the probability that Venus wins the game eventually, starting from 
Deuce. Find v. 
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q 
p VENUS Py VENUS 
AHEAD (A) WINS (W) 
DEUCE (D) a 
VENUS VENUS 
¢ BEHIND (B).~q > |_LOSES (L) 
Pp 


Use First-step analysis. The possible steps starting from Deuce are: 


1. Venus wins the next point (probability p): move to state A; 


2. Venus loses the next point (probability q): move to state B. 


Let V be the event that Venus wins EVENTUALLY starting from Deuce, so v = 
P(V | D). Starting from Deuce (D), the possible steps are to states A and B. So: 
v = P(Venus wins|D) = P(V|D) 


= Pp(V) 
= Pp(V|A)Pp(A) + Po(V | B)Pp(B) 
= PV|A)pt+PV|B)q. (x) 


Now we need to find P(V | A), and P(V | B), again using First-step analysis: 


PV|A) = PV |W)p+ PV | D)g 


Similarly, 
PV|B) = PV|L)g+ PV |D)p 


= 0OxXxqt+vUXp 
= pv. (0) 
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Substituting (a) and (b) into (x), 

v = (pt+qu)p + (pu)q 
p + 2pqu 
v(1—2pq) = p’ 


U 


py 


1 — 2pq' 


Note: Because p+ q = 1, we have: 
L=(p+q) =p? +@ + 2pq. 


So the final probability that Venus wins the game is: 


Pp ip 


1—2pq  p?+q? 

Note how this result makes intuitive sense. For the game to finish from Deuce, 
either Venus has to win two points in a row (probability p”), or Serena does 
(probability qg?). The ratio p*/(p? + 4’) describes Venus’s ‘share’ of the winning 


probability. 
First-step analysis as the Partition Theorem: 


Our approach to finding v = P(Venus wins) can be summarized as: 


P( Venus wins) = v = S- P(V | first step)P(first step) . 
first steps 


First-step analysis is just the Partition ‘Theorem: 
The sample space is Q = { all possible routes from Deuce to the end }. 


*~BOAO DA BOL. 


An example of a sample point is: D— A— D 
Another example iss D—> B+ D> A->W. 


The partition of the sample space that we use in first-step analysis is: 
R, = { all possible routes from Deuce to the end that start with D — A} 
Ry = { all possible routes from Deuce to the end that start with D > B} 
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Then first-step analysis simply states: 
PV) = P(V| Ri)PURi) + PV | Re) PCR) 
Pp(V | A)Pp(A) + Po(V | B)Pp(B). 


Notation for quick solutions of first-step analysis problems 


Defining a helpful notation is central to modelling with stochastic processes. 
Setting up well-defined notation helps you to solve problems quickly and easily. 
Defining your notation is one of the most important steps in modelling, because 
it provides the conversion from words (which is how your problem starts) to 
mathematics (which is how your problem is solved). 


Several marks are allotted on first-step analysis questions for setting 
up a well-defined and helpful notation. 


q 


Pp VENUS P VENUS 
Fo (A) | WINS (Ww) 


DEUCE (D) Ns! 
VENUS > VENUS 
A BEHIND (B)| gq LOSES (L) 


Pp 


Here is the correct way to formulate and solve this first-step analysis problem. 


Need the probability that Venus wins eventually, starting from Deuce. 
1. Define notation: let 


vp = P(Venus wins eventually | start at state D) 
v4 = P(Venus wins eventually | start at state A) 
vp = P(Venus wins eventually | start at state B) 


2. First-step analysis: 
aac Up = pvat+qup (a) 


px1+qup (0) 


VA 


vp = pyp+qx0 (c) 


7a 
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3. Substitute (b) and (c) in (a): 


| 
= 
Ss 

be 
Q 

io 
3 
a 
=e 
Ss 

i 
= 


=> UD 
vp(1—pq—pq) = p 
UD 


as before. 


Special Process: the Gambler’s Ruin 


This is a famous problem in probability. A gambler 
starts with $x. She tosses a fair coin repeatedly. 


If she gets a Head, she wins $1. If she gets a Tail, 
she loses $1. 


The coin tossing is repeated until the gambler has either $0 or $N, when she 
stops. What is the probability of the Gambler’s Ruin, i.e. that the gambler 
ends up with $0? 


1/2 1/2 1/2 1/2 1/2 1/2 
a [a ae ls 
Se WL oe- ee 
1/2 1/2 1/2 1/2 1/2 1/2 


Wish to find 
P(ends with $0| starts with $x) . 


Define event 
R = {eventual Ruin} = {ends with $0} . 


We wish to find P(R| starts with $x). 


Define notation: 


px = P(R| currently has $x) forz =0,1,...,N. 
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Information given: 


| 
or 


po = P(R| currently has $0) 
pn = P(R| currently has $N) = 


First-step analysis: 
Dy = P(R| has $x) 


_ 4P(R | has $(a + 1)) ae 4P(R | has $(a — 1)) 


= $Dr+l ' $Px-1 (x) 
True for x = 1,2,..., N — 1, with boundary conditions pp) = 1, py = 0. 


Solution of difference equation (x): 
Pr = SDr+l =F $Pr-1 for 7 = 1,2,...,N —1 
py = 1 (x) 
pn = 0. 


We usually solve equations like this using the theory of 2nd-order difference 
equations. For this special case we will also verify the answer by two other 
methods. 


1. Theory of linear 2nd order difference equations 


Theory tells us that the general solution of (x) is pp = A+ Bx for some constants 
A, B and for x = 0,1,...,N. Our job 1s to find A and B using the boundary 
conditions: 


p, = A+ Bz for constants A and B and forx = 0,1,..., N. 
So 


p = A+Bx0=1 5 AH=t1; 
1 
A+BxN=1+BN=0 > B= ——. 


PN 
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So our solution is: pr = A+ Ba=1-— for = 0,1, a5 Ns 


For Stats 325, you will be told the general solution of the 2nd-order difference 
equation and expected to solve it using the boundary conditions. 


For Stats 721, we will study the theory of 2nd-order difference equations. You 
will be able to derive the general solution for yourself before solving it. 


Question: What is the probability that the gambler wins (ends with $N), 
starting with $x? 


P(ends with SN) = — P(ends with $0) ee + for x =0,1,...,N. 


2. Solution by inspection 


The problem shown in this section is the symmetric Gambler’s Ruin, where 
the probability is - of moving up or down on any step. For this special case, 
we can solve the difference equation by inspection. 


We have: ‘ i 
Pe = 9Pr+1 ot 3Pr-1 

5De ar $Dx = SDr+l =e $Px-l 
Rearranging: D4 — De = De — Desa Boundaries: po = 1, pn = 0. 
There are N steps to go down Py=1 
from pp = 1 topy = 0. i Ze 8. 
Each step is the same size, aa Pein ie) Games 
because ——— if een 
(Die — Dy) — (Dz — Pr+1) for all x. e-t--- 
So each step has size 1/N, pH 0 
=> po=lm=1-1/N, ia 

po = 1-—2/N, etc. 0 1 2 N 

So 


Dy =1—- ~ as before. 


3. Solution by repeated substitution. 


In principle, all systems could be solved by this method, but it is usually too 
tedious to apply in practice. 


Rearrange (x) to give: 


Prt+1 

=> (c=1) P2 
(x = 2) D3 

(x = 3) Pa 
giving Px 
likewise DN 


Boundary condition: 
Substitute in (xx): 


Px 


Px 


2.8 Independence 


= 2px — Pr-1 

= 2p,-1 (recall py) = 1) 

= po — pi = 2(2p, — 1) — pi = 3p — 2 

= 2p3 — po = 2(3p, — 2) — (2p, — 1) = 4p, — 3 etc 


= xp, —(x-1) in general, (xx) 
= Np, —(N-1) at endpoint. 
py=0 => Np -(N-1)=0 => py=1-1/N. 


= xp, —(x-1) 

= x(1-%)-(¢-1) 

= 2“2-7-2t+1 

= 1-7 as before. O 


Definition: Events A and B are statistically independent if and only if 


P(ANB) = P(A)P(B). 


This implies that A and B are statistically independent if and only if 


P(A| B) = P(A). 


Note: If events are physically independent, they will also be statistically indept. 
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For interest: more than two events 


Definition: For more than two events, A;, Ag,...,An, we say that A;, Ao,..., An 
are mutually independent if 


P (n 4 = ][2) for ALL finite subsets J C {1,2,...,n}. 


ie J ie J 


Example: events A,, Aj, A3, Ay are mutually independent if 
ii) P(A;N.A;NA;) = P(A;)P(A;)P(Ax) for all ¢, j, k that are all different; AND 
iii) P(A, N Ag NM A3M Ag) = P(A1)P(A2)P(A3)P(Ag). 


Note: For mutual independence, it is not enough to check that P(A;N A;) = 
P(A;)P(A,;) for all i ¢ j. Pairwise independence does not imply mutual inde- 
pendence. 


2.9 The Continuity Theorem 


The Continuity Theorem states that probability is a continuous set function: 
Theorem 2.9: The Continuity Theorem 


a) Let A;, Ao,... be an increasing sequence of events: i.e. 


Age As 23 AC Aig Coen. 3 


Then 
P( lim An] = lim P(A,). 


n— OO nN— OO 


CO 
Note: because A, C Ay C..., we have: lim A, = UJ An. 
nN— OO 


n=1 
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b) Let By, Bg,... be a decreasing sequence of events: i.e. 


BeBe on, By, i can 


Then 
P( lim By) = lim P(B,). 


nN— CO nN—- Co 


CO 
Note: because B, D By D..., we have: lim 6, = () Ban 
N— OOo | 


Proof (a) only: for (b), take complements and use (a). 


Define C) = Aj, and C; = A;\A;_; fori = 2,3,.... Then C1, Co,... are mutually 
exclusive, and LJ;., Ci; = Uj_, Ai, and likewise, U2, Ci = Uj, Ai. 


Thus 
Pi lim A, =P (U 4 =P (U < = YS P(C) (C; mutually exclusive) 
i=1 i=1 i=1 


ae 


~ tin? (U a) 


i=l 
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2.10 Random Variables 


Definition: A random variable, X, is defined as a function from the sample space 
to the real numbers: X : 02. —> R. 


A random variable therefore assigns a real number to every possible outcome of 
a random experiment. 


A random variable is essentially a rule or mechanism for generating random real 
numbers. 


The Distribution Function 


Definition: The cumulative distribution function of a random variable X is 
given by 
Fy(z) = PX <a) 


Fx (a) is often referred to as simply the distribution function. 


Properties of the distribution function 


2) F(x) is a non-decreasing function of a: 
if x1 < X», then Fy (x1) = Fy (x). 


3) If b >a, then P(a < X < b) = Fx(b) — F x(a). 


4) Fx is right-continuous: i.e. limpjo Fx (a + h) = F(a). 
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2.11 Continuous Random Variables 


Definition: The random variable X is continuous if the distribution function F'y (x) 
is a continuous function. 


In practice, this means that a continuous random variable takes values in a 
continuous subset of R: e.g. X : Q — [0,1] or X :Q > [0, co). 
Fx (2) 


Probability Density Function for continuous random variables 


Definition: Let X be a continuous random variable with continuous distribution 
function F’y(x). The probability density function (p.d.f.) of X is defined 
as 


fx(2) = Fy(2) = <(Fx(0)) 


The pdf, fx(x), gives the shape of the distribution of X. 


Normal distribution Exponential distribution Gamma distribution 


By the Fundamental Theorem of Calculus, the distribution function F'y(x) can 
be written in terms of the probability density function, fx(x), as follows: 


Fx(0) = J", f(u) du 


For continuous random variables, every point x has P(X = x) = 0. This 
means that the endpoints of intervals are not important for continuous random 


Endpoints of intervals 


variables. 
This, Pox ’4<0) =Pa< xX <= Pax <b) = Pax Xx <b). 


This is only true for continuous random variables. 
Calculating probabilities for continuous random variables 


To calculate P(a < X < b), use either 
P(a < x — b) => Fy (b) = F’x(a) 


or : 
Pea xX <b) = / fx(a) dx 


Example: Let X be a continuous random variable with p.d_f. 


fx(x 


2c tor 1D, 
)= 


0 otherwise. 


(a) Find the cumulative distribution function, F’y(z). 
(b) Find P(X < 1.5). 
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, Fy()= f fe(u)au= ff 2w?au = a) 


2 
— for1 <x < 2. 
—| 1 L 


0 forx <1, 


F(x) = —2 forl<2x<2, 


Thus 


lL fore > 2. 


2 2 
< 7 — e = -—_—_— SS -, 
b) P(X $1.5) = Fx(1.5) =2-T = 


2.12 Discrete Random Variables 


Definition: The random variable X is discrete if X takes values in a finite or count- 
able subset of R: thus, X :Q — {x1,%o,...}. 


When X is a discrete random variable, the distribution function F'y(x) is a step 
function. 


Fx (a) 


Probability function 


Definition: Let X be a discrete random variable with distribution function F'x(z). 
The probability function of X is defined as 


fx(x) = P(X = ak 
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Endpoints of intervals 
For discrete random variables, individual points can have P(X = x) > 0. 


This means that the endpoints of intervals ARE important for discrete random 
variables. 


For example, if X takes values 0,1,2,..., and a, b are integers with 6 > a, then 


Pia< X <b) =P(a-1< X <b) =P(la< X <b+1) =P(a-1< X <b41). 


Calculating probabilities for discrete random variables 
To calculate P(X € A) for any countable set A, use 


P(X € A)=) P(X =2). 


rEA 


Partition Theorem for probabilities of discrete random variables 


Recall the Partition Theorem: for any event A, and for events B,, Bo,... that 


form a partition of , 
P(A) = a P(A| By) P(By). 
y 


We can use the Partition Theorem to find probabilities for random variables. 
Let X and Y be discrete random variables. 


e Define event A as A= {X = x}. 


e Define event B, as B, = {Y = y} fory = 0,1,2,... (or whatever values Y 
takes). 


e Then, by the Partition Theorem, 


PASa= \ Rea | ary 4). 


2.13 Independent Random Variables 


Random variables X and Y are independent if they have no effect on each 
other. This means that the probability that they both take specified values 
simultaneously is the product of the individual probabilities. 


Definition: Let X and Y be random variables. The joint distribution function 
of X and Y is given by 


Pey(tjg)=P(X4 <eandy <y) = PX aay <9). 


Definition: Let X and Y be any random variables (continuous or discrete). X and 
Y are independent if 


Bey (a, y) _ Fy (x) Fy(y) for ALL LYE R. 


If X and Y are discrete, they are independent if and only if their joint prob- 
ability function is the product of their individual probability functions: 


Discrete X,Y areindept <= = P(X =2 ANDY =y)=P(X =2z)P(Y =y) 
for ALL x, y 
<=> fxy(z,y)=fx(z)fy(y) for ALL az, y. 
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Chapter 3: Expectation and Variance 


In the previous chapter we looked at probability, with three major themes: 
1. Conditional probability: P(A | B). 


2. First-step analysis for calculating eventual probabilities in a stochastic 
process. 


3. Calculating probabilities for continuous and discrete random variables. 


In this chapter, we look at the same themes for expectation and variance. 
The expectation of a random variable is the long-term average of the random 
variable. 


Imagine observing many thousands of independent random values from the 
random variable of interest. Take the average of these random values. The 
expectation is the value of this average as the sample size tends to infinity. 


We will repeat the three themes of the previous chapter, but in a different order. 


1. Calculating expectations for continuous and discrete random variables. 


2. Conditional expectation: the expectation of a random variable X, condi- 
tional on the value taken by another random variable Y. If the value of 
Y affects the value of X (ie. X and Y are dependent), the conditional 
expectation of X given the value of Y will be different from the overall 
expectation of X. 


3. First-step analysis for calculating the expected amount of time needed to 
reach a particular state in a process (e.g. the expected number of shots 
before we win a game of tennis). 


We will also study similar themes for variance. 


3.1 Expectation 


The mean, expected value, or expectation of a random variable X is writ- 
ten as E(X) or zx. If we observe N random values of X, then the mean of the 
N values will be approximately equal to E(X) for large N. The expectation is 
defined differently for continuous and discrete random variables. 


Definition: Let X be a continuous random variable with p.d.f. fx(xz). The ex- 
pected value of X is 


F(X) = / ” eile. 


CO 


Definition: Let X be a discrete random variable with probability function fx(«). 
The expected value of X is 


K(X) = )— 2fx(c) =p 


Hb 


Expectation of g(X) 


Let g(X) be a function of X. We can imagine a long-term average of g(X) just 
as we can imagine a long-term average of X. This average is written as E(g(X)). 
Imagine observing X many times (N times) to give results 71, 72,...,2y. Apply 
the function g to each of these observations, to give g(x1),...,g(aw). The mean 
of (21), 9(%2),---,9(@n) approaches E(g(X )) as the number of observations NV 
tends to infinity. 


Definition: Let X be a continuous random variable, and let g be a function. The 
expected value of g(X) is 


B(9(X)) = foe) fel) de 


—oo 


Definition: Let X be a discrete random variable, and let g be a function. The 
expected value of g(X) is 


E(9(X )) = da@ )fx(2 = ee 
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Expectation of XY: the definition of E(XY) 


Suppose we have two random variables, X and Y. These might be independent, 
in which case the value of X has no effect on the value of Y. Alternatively, 
X and Y might be dependent: when we observe a random value for X, it 
might influence the random values of Y that we are most likely to observe. For 
example, X might be the height of a randomly selected person, and Y might 
be the weight. On the whole, larger values of X will be associated with larger 
values of Y. 


To understand what E(XY) means, think of observing a large number of pairs 
(21, Y1), (€2, Y2),---,(@n, yw). If X and Y are dependent, the value x; might 
affect the value y;, and vice versa, so we have to keep the observations together 
in their pairings. As the number of pairs N tends to infinity, the average 
— ar x; X y; approaches the expectation E(XY). 


For example, if X is height and Y is weight, E(XY) is the average of (height 
x weight). We are interested in E(XY) because it is used for calculating the 
covariance and correlation, which are measures of how closely related X and Y 
are (see Section 3.2). 


Properties of Expectation 


i) Let g and h be functions, and let a and b be constants. For any random variable 


X (discrete or continuous), 


E4a9(X) - bh(x) } = a g(X)} n bE h(X) }. 


In particular, 
E(aX + b) = aE(X) +6. 


ii) Let X and Y be ANY random variables (discrete, continuous, independent, or 


non-independent). Then 
E(XX +Y)=E(X)+E(Y). 


More generally, for ANY random variables X1,..., Xn, 
E(X,+...+X,) = E(X1) +...+E(X,). 


iii) Let X and Y be independent random variables, and g, h be functions. Then 
E(XXY) = EX)E(Y 
E(9(X)h(Y)) = E(9(X))E(A(Y)). 


Notes: 1. E(XY) =E(X)E(Y) is ONLY generally true if X and Y are 
INDEPENDENT. 


2. If X and Y are independent, then E(XY) = E(X)E(Y). However, the 
converse is not generally true: it is possible for E(XY) = E(X)E(Y) even 
though X and Y are dependent. 


Probability as an Expectation 


Let A be any event. We can write P(A) as an expectation, as follows. 
Define the tndicator function: 


1 if event A occurs, 
i= 


0 otherwise. 


Then J, is a random variable, and 


E(l4) = >> rP(l4=r) 


= 0x P(I4=0) 41x PU =1) 
= P(I,4=1) 
~ P(A). 


Thus P(A) = E(/,) for any event A. 
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3.2 Variance, covariance, and correlation 


The variance of a random variable X is a measure of how spread out it is. 
Are the values of X clustered tightly around their mean, or can we commonly 
observe values of X a long way from the mean value? The variance measures 
how far the values of X are from their mean, on average. 


Definition: Let X be any random variable. The variance of X is 


Var(X) = E((Xx = ux’) = E(X?) — (E(X))?, 


The variance is the mean squared deviation of a random variable from its own 
mean. 


If X has high variance, we can observe values of X a long way from the mean. 


If X has low variance, the values of X tend to be clustered tightly around the 
mean value. 


Example: Let X be a continuous random variable with p.d-f. 


In for 1 <a <2, 
fx(z) = 


0 otherwise. 


Find E(X) and Var(X). 


2 
/ 2a! dx 
1 


2 log(x)| 


2 log(2) — 2 log(1) 


oe) 2 
B(x) = f v fx(e)de = f a x Qa? dx 


2 log(2). 
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For Var(X), we use 
Var(X) = E(X?) — {E(X)}’ . 


Now 
oe) 2 2 
2 2 = 2 a 
B(x) = fs fx(x) dx / De dae [2a 
2 
= [2], 
= 2x2-2x1 
= 2: 
Thus 
Var(X) = E(X?) — {E(X)}° 
= 2— {2log(2)} 
= 0.07/82. 


Covariance 


Covariance is a measure of the association or dependence between two random 
variables X and Y. Covariance can be either positive or negative. ( Variance is 
always positive.) 


Definition: Let X and Y be any random variables. The covariance between X 
and Y is given by 


cov(X, ¥) = E{ (X — px)(V — py) } = E(XY) - E(X)E(Y), 
where jx = E(X), py = E(Y). 


1. cov(X, Y) will be positive if large values of X tend to occur with large values 
of Y, and small values of X tend to occur with small values of Y. For example, 
if X is height and Y is weight of a randomly selected person, we would expect 
cov(X, Y) to be positive. 
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2. cov(X, Y) will be negative if large values of X tend to occur with small values 
of Y, and small values of X tend to occur with large values of Y. For example, 
if X is age of a randomly selected person, and Y is heart rate, we would expect 
X and Y to be negatively correlated (older people have slower heart rates). 


3. If X and Y are independent, then there is no pattern between large values of 
X and large values of Y, so cov(X, Y) = 0. However, cov(X, Y) = 0 does NOT 
imply that X and Y are independent, unless X and Y are Normally distributed. 


Properties of Variance 


i) Let g be a function, and let a and b be constants. For any random variable X 
(discrete or continuous), 


Var{ ag(X) +} = a? Var{ 9(X) }. 

In particular, Var(aX + b) = a? Var(X). 
ii) Let X and Y be independent random variables. Then 
Var(X + Y) = Var(X) + Var(Y). 


iii) If X and Y are NOT independent, then 
Var(X + Y) = Var(X) + Var(Y) + 2cov(x,Y). 
Correlation (non-examinable) 


The correlation coefficient of X and Y is a measure of the linear association 
between X and Y. It is given by the covariance, scaled by the overall variability 
in X and Y. Asa result, the correlation coefficient is always between —1 and 
+1, so it is easily compared for different quantities. 


Definition: The correlation between X and Y, also called the correlation coefficient, 
is given by 


X,Y 
corr(X, ¥) = OY) _ 


Var(X )Var(Y ) 


3.3 
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The correlation measures linear association between X and Y. It takes values 
only between —1 and +1, and has the same sign as the covariance. 


The correlation is +1 if and only if there is a perfect linear relationship between 
X and Y, ie. corr(X,Y) =1 <— = Y =aX +6 for some constants a and 6. 


The correlation is 0 if X and Y are independent, but a correlation of 0 does 
not imply that X and Y are independent. 


Conditional Expectation and Conditional Variance 


Throughout this section, we will assume for simplicity that X and Y are dis- 
crete random variables. However, exactly the same results hold for continuous 
random variables too. 


Suppose that X and Y are discrete random variables, possibly dependent on 
each other. Suppose that we fix Y at the value y. This gives us a set of 
conditional probabilities P(X = «| Y = y) for all possible values x of X. This 
is called the conditional distribution of X, given that Y = y. 


Definition: Let X and Y be discrete random variables. The conditional probability 


function of X, given that Y = y, is: 


P(X =z ANDY =y) 
P(Y =y) 
We write the conditional probability function as: 


fxjy(@ly) =P(X =2|Y =y). 


PX ejay =a)= 


Note: The conditional probabilities fy)y(x|y) sum to one, just like any other 


probability function: 
YPC at Ya) => Pes =a =, 


using the subscript notation Pyy_,; of Section 2.3. 
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We can also find the expectation and variance of X with respect to this condi- 
tional distribution. That is, if we know that the value of Y is fixed at y, then 
we can find the mean value of X given that Y takes the value y, and also the 
variance of X given that Y = y. 


Definition: Let X and Y be discrete random variables. The conditional expectation 
of X, given that Y = y, is 


bx |y-y =E(X|¥ =y) =) efxyy(ely). 


ax 


E(X | Y =y) is the mean value of X, when Y is fixed at y. 


Conditional expectation as a random variable 


The unconditional expectation of X, E(X), is just a number: 
eg. EX =2 or EX =5.8. 


The conditional expectation, E(X | Y = y), is a number depending on y. 


If Y has an influence on the value of X, then Y will have an influence on the 
average value of X. So, for example, we would expect E(X |Y = 2) to be 
different from E(X | Y = 3). 


We can therefore view E(X | Y = y) asa function of y, say E(X | Y=y) = h(y). 
To evaluate this function, h(y) = E(X |Y = y), we: 
i) fix Y at the chosen value y; 


ii) find the expectation of X when Y is fixed at this value. 


HE UNIVERSITY 
OF AUCKLAND 


Te Whare Wananga o Tamaki Makaurau 3, 


However, we could also evaluate the function at a random value of Y: 
i) observe a random value of Y ; 

ii) fix Y at that observed random value; 

iii) evaluate E(X | Y = observed random value). 

We obtain a random variable: E(X |Y) = h(Y). 


The randomness comes from the randomness in Y , not in X. 


Conditional expectation, E(X | Y), is a random variable 


with randomness inherited from Y , not X. 


1 with probability 1/8, 


E. le: Y= 
xample: Suppose ‘ 2 with probability 7/8 , 


2Y with probability 3/4, 


and X|Y = ‘ 3Y with probability 1/4. 


Conditional expectation of X given Y = y is a number depending on y: 


2 with probability 3/4 


Ros Se ee el ‘ 3. with probability 1/4 


so E(X|Y=1)=2x#4+3xt=2. 


IfY =2, then: X|(Y =2) = ‘ 4 with probability 3/4 


6 with probability 1/4 


so E(X|Y=2)=4x#24+6x4=4%. 
9/4 ify=1 


pes Bee a) -{ 18/4 ify =2. 


So E(X | Y = y) is a number depending on y, or a function of y. 
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Conditional expectation of X given random Y is a random variable: 


9/4 if Y =1 (probability 1/8), 


From above, E(X |Y) = ‘ 18/4 if Y = 2 (probability 7/8). 


_ f 9/4. with probability 1/8, 
bos Eee) ‘ 18/4 with probability 7/8. 
Thus E(X | Y) is a random variable. 


The randomness in E(X | Y ) is inherited from Y, not from X. 


Conditional expectation is a very useful tool for finding the unconditional 
expectation of X (see below). Just like the Partition Theorem, it is useful 
because it is often easier to specify conditional probabilities than to specify 
overall probabilities. 


Conditional variance 


The conditional variance is similar to the conditional expectation. 
e Var(X | Y = y) is the variance of X, when Y is fixed at the value Y = y. 


e Var(X | Y) is a random variable, giving the variance of X when Y is fixed 
ata value to be selected randomly. 


Definition: Let X and Y be random variables. The conditional variance of X, 


given Y, is given by 


VarlX |¥) = B(x? |¥) - {E(x |Y)} = B{(X - px 1Y} 


Like expectation, Var(X | Y = y) is a number depending on y (a function of y), 
while Var(X | Y ) is a random variable with randomness inherited from Y . 


ran 
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Laws of Total Expectation and Variance 


If all the expectations below are finite, then for ANY random variables X and 
Y, we have: 


Law of Total Expectation. 


Note that we can pick any r.v. Y , to make the expectation as easy as we can. 


ii) E(g(X)) = Ey (E(9(X) | Y)) for any function g. 


ii) | Var(X) = By (Var x | Y)) + Vary (E(X | Y)) 


Law of Total Variance. 


Note: [Ky and Vary denote expectation over Y and variance over Y , 


i.e. the expectation or variance is computed over the distribution of the random 
variable Y. 


The Law of Total Expectation says that the total average is the average of case- 
by-case averages. 


e The total average is E(X); 


e The case-by-case averages are E(X | Y) for the different values of Y ; 


e The average of case-by-case averages is the average over Y of the Y-case 
averages: Ey (E(x | Y)): 


9/4 with probability 1/8, 


Example: Inthe example above, wehad: E(X | Y) = ‘ 18/4 with ptobability7/8 


The total average is: 
9 
E(X) = Ey {E(X | y)} =7x 


Proof of (i), (ii), (iii): 


(i) is a special case of (ii), so we just need to prove (ii). Begin at RHS: 


> 9(x)P(X = 2 Y) 


3 aces 257) 
dD 9a) P(X = 2|Y = y)PYY =y) 
S\9(x) S P(X =2|/¥ =y)P(Y =y) 


S 5 g(x) P(X =x) (partition rule) 


= E(g(X)) = LHS. 


RHS = Ey [E(9(X)|Y)] = Ey 


P(Y = y) 


(iii) Wish to prove Var(X) = Ey|Var(X | Y)] + Vary[E(X | Y)]. Begin at RHS: 
Ky[Var(X | Y)] + Vary[E(X | Y)] 

= Ky { E(X?|¥) — (E(x |¥))?} + 4 By { fer YP} oo | "y) 
= Ey{E(X"| ¥)} —Ey{[E(X | Y)]"} + Ey{[E(X |Y)}"} — (EX)? 


E(X?) by part. (i) 


| es | 


= E(X?) — (EX)? 


=Var(X)=LHS. O 


NEW ZEALAND 


57 


3.4 Examples of Conditional Expectation and Variance 


1. Swimming with dolphins 


Fraser runs a dolphin-watch business. 
Every day, he is unable to run the trip 
due to bad weather with probability p, 
independently of all other days. Fraser works every day except the bad-weather 
days, which he takes as holiday. 


Let Y be the number of consecutive days Fraser has to work between bad- 
weather days. Let X be the total number of customers who go on Fraser’s trip 
in this period of Y days. Conditional on Y, the distribution of X is 


(X |Y) ~ Poisson(pY). 


(a) Name the distribution of Y, and state E(Y) and Var(Y). 


(b) Find the expectation and the variance of the number of customers Fraser 
sees between bad-weather days, E(X) and Var(X). 


(a) Let ‘success’ be ‘bad-weather day’ and ‘failure’ be ‘work-day’. 
Then P(success) = P(bad-weather) = p. 
Y 1s the number of failures before the first success. 


So 
Y ~ Geometric(p). 
Thus 
il: — 
E(Y) = —, 
Pp 
Lp 
Var(Y) = . 
(Y) P 


(b) We know (X |Y) ~ Poisson(Y ): so 
E(X|Y) = Var(X |Y) = py. 


By the Law of Total Expectation: 


psc 
S 
| 


= By {E(X | y)} 
Ey (uY) 
= pHEy(Y) 


_ pwd —p) 
E(x) = “. 


By the Law of Total Variance: 
Var(X) = Ey (Var(X|Y)) + Vary (E(X|Y)) 
= Ey (uv) + Vary (uy) 
= pEy(Y) +42 Vary(Y) 
l-—p l-—p 
=o) (SE 
Pp Pp 
wl — p)(p +p) 
p . 


Checking your answer in R: 


If you know how to use a statistical package like R, you can check your answer 
to the question above as follows. 


> # Pick a value for p, e.g. p = 0. 
= 20 


2 
> # Pick a value for mu, e.g. mu = 2 
# Generate 10,000 random values of Y ~ Geometric(p = 0.2): 
y <- rgeom(10000, prob=0.2) 


# Generate 10,000 random values of X conditional on Y: 
# use (X | Y) ~ Poisson(mu * Y) ~ Poisson(25 * Y) 
x <- rpois(10000, lambda = 25*y) 


VVV VV VM 
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> # Find the sample mean of X (should be close to E(X)): 
> mean(x) 

[1] 100.6606 

> 

> # Find the sample variance of X (should be close to var(X)): 
> var(x) 

[1] 12624.47 

> 

> # Check the formula for E(X): 

S25. #1 = 052) 7 0.2 

[1] 100 

> 

> # Check the formula for var(X): 

> 26% C= 0.2) & (0.2 #25) 0.2"2 

[1] 12600 


The formulas we obtained by working give E(X) = 100 and Var(X) = 12600. 
The sample mean was 7 = 100.6606 (close to 100), and the sample variance 
was 12624.47 (close to 12600). Thus our working seems to have been correct. 
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2. Randomly stopped sum 


This model arises very commonly in stochastic 


processes. A random number N of events occur, $s 5 is a 
and each event 7 has associated with it some cost, — 
penalty, or reward X;. The question is to find the gHEB 

mean and variance of the total cost / reward: 

Ty = X,+Xot+...4+Xy. 


The difficulty is that the number N of terms in the sum is itself random. 


Tw is called a randomly stopped sum: it is a sum of X;’s, randomly stopped at 
the random number of N terms. 


Example: Think of a cash machine, which has to be loaded with enough money to 
cover the day’s business. The number of customers per day is a random number 
N. Customer 2 withdraws a random amount X;. The total amount withdrawn 
during the day is a randomly stopped sum: Ty = X,+...+ Xv. 
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Cash machine example 


The citizens of Remuera withdraw money from a cash machine according to the 
following probability function (X): 


Amount, x ($)|50 100 200 
P(X=2)|03 05 0.2 


The number of customers per day has the distribution N ~ Poisson(A). 


Let Ty = X, + Xo+...+ Xy be the total amount of money withdrawn in 
a day, where each X; has the probability function above, and X,, X2,... are 
independent of each other and of N. 

Ty is arandomly stopped sum, stopped by the random number of N customers. 


(a) Show that E(X) = 105, and Var(X) = 2725. 


(b) Find E(Ty) and Var(Ty): the mean and variance of the amount of money 
withdrawn each day. 


Solution 


(a) Exercise. 


(b) Let Ty = S>\, X;. If we knew how many terms were in the sum, we could easily 


find E(T) and Var(T'y) as the mean and variance of a sum of independent r.v.s. 


So ‘pretend’ we know how many terms are in the sum: 1.e. condition on N. 


E(Tw|N) = E(X 


Xo .+ Xn |N) 
Xo + Xn) 
(because all X;s are independent of N ) 
E(X1) + E(X2) +... + E(Xy) 
where N is now considered constant; 
(we do NOT need independence of X;’s for this) 
Nx E(X)_ (because all X;,’s have same mean, E(X )) 
= 105N. 
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Similarly, 


Var(Ty | N) Var( X, t Xo+...4 Xn |N) 
= Var( X + Xo+...4 Xn) 


where N is now considered constant; 


(because all X;’s are independent of N ) 
= Var(X,) + Var(X2) +...+ Var(Xw) 
(we DO need independence of X;’s for this) 
= Nx Var(X) (because all X;’s have same variance, Var(X )) 
= 2(25N. 


So 


E(Tw) Ey { E(Tv|N)} 
Ew(105N) 
= 105Ey(N) 


= 105X, 


because N ~ Poisson(X) so E(.N) = X. 
Similarly, 


Var(Ty) = Ew { Var(Ty|N)} + Vary { E(Dy |) } 
= Ey {2725N} + Vary {105N} 
= 2725Ey\(N) + 105? Vary(N) 
= 2725+ 11025. 
= 137504, 


because N ~ Poisson(X) so E(N) = Var(N) = A. 
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Check in R (advanced) 


> # Create a function tn.func to calculate a single value of T_N 


Vv 


# for a given value N=n: 

tn.func <- function(n){ 
sum(sample(c(50, 100, 200), n, replace=T, 
prob=c(0.3, 0.5, 0.2))) 


Vv 


# Generate 10,000 random values of N, using lambda=50: 

N <- rpois(10000, lambda=50) 

# Generate 10,000 random values of T_N, conditional on N: 

TN <- sapply(N, tn.func) 

# Find the sample mean of T_N values, which should be close to 
# 105 * 50 = 5250: 

mean (TN) 

ELT 252532255 

> # Find the sample variance of T_N values, which should be close 
> # to 13750 * 50 = 687500: 

> var (TN) 

[1] 682469.4 


VVV VV VM 


All seems well. Note that the sample variance is often some distance from the 
true variance, even when the sample size is 10,000. 


General result for randomly stopped sums: 


Suppose X1, X9,... each have the same mean ju and variance o”, and X1, Xo,..., 
and N are mutually independent. Let Ty = X, +...+ Xy be the randomly 
stopped sum. By following similar working to that above: 


E(Ty) “21 yx] 


Var(Ty) =v} yo] 


i=l 
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3.5 First-Step Analysis for calculating expected reaching times 


Remember from Section 2.6 that we use First-Step Analysis for finding the 
probability of eventually reaching a particular state in a stochastic process. 
First-step analysis for probabilities uses conditional probability and the Partition 
Theorem (Law of Total Probability). 


In the same way, we can use first-step analysis for finding the expected reaching 
time for a state. 


This is the expected number of steps that will be needed to reach a particular 
state from a specified start-point, or the expected length of time it will take to 
get there if we have a continuous time process. 


Just as first-step analysis for probabilities uses conditional probability and the 
law of total probability (Partition Theorem), first-step analysis for expectations 
uses conditional expectation and the law of total expectation. 


First-step analysis for probabilities: 


The first-step analysis procedure for probabilities can be summarized as follows: 


P(eventual goal) = S- P(eventual goal | option)P( option) . 
first-step 
options 
This is because the first-step options form a partition of the sample space. 


First-step analysis for expected reaching times: 


The expression for expected reaching times is very similar: 


E(reaching time) = > (reaching time | option) P(option) . 
first-step 
options 
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This follows immediately from the law of total expectation: 


E(X) =Ey{E (x|¥)} = Bei) P(Y =y). 


Let X be the reaching time, and let Y be the label for possible options: 


16. ¥ = 1,2,3, «05 ior opuons 12,35 a. 
We then obtain: 


E(X) = SUE(X|Y =y)P(Y =y) 


1.€, (reaching time) 
first-step 
options 


Example 1: Mouse in a Maze 
A mouse is trapped in a room with three exits at 
the centre of a maze. 
e Exit 1 leads outside the maze after 3 minutes. 
e Exit 2 leads back to the room after 5 minutes. 


e Exit 3 leads back to the room after 7 minutes. 


64 


> IE(reaching time | option)P( option) . 


Every time the mouse makes a choice, it is equally likely to choose any of the 
three exits. What is the expected time taken for the mouse to leave the maze? 


Exit 2 
Let X = time taken for mouse to ss 1B 
. A 
leave maze, starting from room R. a 
Room 


Let Y = exit the mouse chooses 3 mins 
first (1, 2, or 3). 7 mins 1/3 


Exit 3 


Exit 1 
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xy = Ey (E(X|Y)) 
= \OE(X|Y =y)P(Y =y) 
— E(X|Y <1) x f4+E(X|¥ =2) x14 E(X|¥ <3) x}. 


But: 

E(X|Y =1) = 8 minutes 

E(X|Y =2) = 5+ E(X) (after 5 mins back in Room, time E(X) to get out) 
E(X|Y =3) = 7+ E(X) (after 7 mins, back in Room) 


ps2 
ia 
| 


3xd+(5+EX) x}+(7+EX) x} 
= 15x $4+2(EX) x3 

E(X) = 15x43 

=> E(X) = 15 minutes. 


Notation for quick solutions of first-step analysis problems 


As for probabilities, first-step analysis for expectations relies on a good notation. 
The best way to tackle the problem above is as follows. 


Define mp = E(time to leave maze | start in Room). 
First-step analysis: 
mr = 3X3+4+4x (5+mp”) +x (7+mp) 
=> 3mp = (34+5+7)+2mr 


=> mp = 15 minutes (as before). 
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Example 2: Counting the steps 


The most common questions involving first-step analysis for expectations ask 
for the expected number of steps before finishing. The number of steps 
is usually equal to the number of arrows traversed from the current state to the 
end. 


The key point to remember is that when we take expectations, we are usually 
counting something. 


You must remember to add on whatever we are counting, to every step taken. 


1/3 
The mouse is put in a new maze with i.) 
two rooms, pictured here. Starting from Room 1 1B 


Room 1, what is the expected number of | 
1/3 1/3 


EXIT 


steps the mouse takes before it reaches 
the exit? 


Room = 13 
1. Define notation: let 1/3 


m, = E(number of steps to finish| start in Room 1) 


mz = E(number of steps to finish| start in Room 2). 
2. First-step analysis: 


x14 


(1 +m) 4 


My = 


x14 


Wl wl 


m2 = (1 +1) 4 


We could solve as simultaneous equations, as usual, but in this case inspection of 
(a) and (b) shows immediately that m, = m2. Thus: 


(a) => 3m, = 3+2m, 


=> m, = 3 steps. 


Further, Mm = Mm, 3 steps also. 
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Incrementing before partitioning 


3.6 


13 
In many problems, all possible first-step { ) 
; — Room | 13 
options incur the same initial penalty. 
The last example is such a case, because | hn EXIT 


every possible step adds | to the Roan - 
total number of steps taken. t ) 
1/3 


In a case where all steps incur the same penalty, 
there are two ways of proceeding: 


1. Add the penalty onto each option separately: e.g. 


2. (Usually quicker) Add the penalty once only, at the beginning: 


m=1l t+ 5X O+ 4m, + 4m. 


In each case, we will get the same answer (check). This is because the option 
probabilities sum to 1, so in Method 1 we are adding (§+3+3)x1l=1x1l=1, 
Just as we are in Method 2. 


Probability as a conditional expectation 
Recall from Section 3.1 that for any event A, we can write P(A) as an expecta- 
tion as follows. 


1 if event A occurs, 


Define the indicator random variable: [4 = 
0 otherwise. 


Then E(I4) = P(I4 = 1) = P(A). 


We can refine this expression further, using the idea of conditional expectation. 
Let Y be any random variable. Then 


P(A) = E(I4) = Ey(E(a|Y)). 
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But 
1 
E(4|Y) = So rP(4=r|Y) 
r=0 
= 0xP(4=0/Y)+1x PUy=1]Y) 
= P(I4=1/Y) 
= P(A[Y). 
Thus 


P(A) = Ey (E(Ls | Y)) =i, (P(A | Y)). 


This means that for any random variable X (discrete or continuous), and for 
any set of values S (a discrete set or a continuous set), we can write: 


e for any discrete random variable Y, 


P(X€S)=S P(X ES|Y =y)P(Y =y). 


e for any continuous random variable Y, 


P(X € 8) = [PX SIY =v) ay 


y 


Example of probability as a conditional expectation: winning a lottery 


Suppose that a million people have bought tickets for the 
weekly lottery draw. Each person has a probability of one- 
in-a-million of selecting the winning numbers. If more than 
one person selects the winning numbers, the winner will be 
chosen at random from all those with matching numbers. 


You watch the lottery draw on TV and your numbers match the winners!! You 
had a one-in-a-million chance, and there were a million players, so it must be 
YOU, right? 


Not so fast. Before you rush to claim your prize, let’s calculate the probability 
that you really will win. You definitely win if you are the only person with 
matching numbers, but you can also win if there there are multiple matching 
tickets and yours is the one selected at random from the matches. 


Define Y to be the number of OTHER matching tickets out of the OTHER 1 
million tickets sold. (If you are lucky, Y = 0 so you have definitely won.) 


If there are 1 million tickets and each ticket has a one-in-a-million chance of 
having the winning numbers, then 
Y ~ Poisson(1) approximately. 


The relationship Y ~ Poisson(1) arises because of the Poisson approximation 
to the Binomial distribution. 


(a) What is the probability function of Y, fy(y)? 


1Y 1 
fy(y) =P(Y =y) =Se* = —_ for y = 0,1,2,.... 
y! ex y! 


(b) What is the probability that yours is the only matching ticket? 


i] 
P(only one matching ticket) = P(Y = 0) = — = 0.368. 
e 


(c) The prize is chosen at random from all those who have matching tickets. 
What is the probability that you win if there are Y = y OTHER matching 
tickets? 


Let W be the event that I win. 


1 
PW | Y =o) = —., 
WY =y)=— 
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(d) Overall, what is the probability that you win, given that you have a match- 
ing ticket? 


PW) = Ey{Pw|y=y)} 


Disappointing? 
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3.7 Special process: a model for gene spread 


Suppose that a particular gene comes in two variants (alleles): A and B. We 
might be interested in the case where one of the alleles, say A, is harmful — 
for example it causes a disease. All animals in the population must have either 
allele A or allele B. We want to know how long it will take before all animals 
have the same allele, and whether this allele will be the harmful allele A or 
the safe allele B. This simple model assumes asexual reproduction. It is very 
similar to the famous Wright-Fisher model, which is a fundamental model of 
population genetics. 


Assumptions: 
1. The population stays at constant size N for all generations. 


2. At the end of each generation, the N animals create N offspring and then 
they immediately die. 


3. If there are x parents with allele A, and N — x with allele B, then each 
offspring gets allele A with probability x/N and allele B with 1 — x/N. 


4. All offspring are independent. 


Stochastic process: 


The state of the process at time t is X; = the number of animals with allele A 
at generation t. 


Each X; could be 0, 1, 2,..., N. The state space is {0,1,2,..., N}. 


Distribution of [| X41 | Xz] 


Suppose that X; = zx, so x of the animals at generation ¢ have allele A. 


Each of the N offspring will get A with probability = and B with probability 
——s 
re 


Thus the number of offspring at time t+1 with allele A is: X;,; ~ Binomial (N, =) . 


We write this as follows: 


| Xt41|X4= 2] ~ Binomial (N, =) 
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If 
[Xi |X:=2] ~ Binomial (N, =), 
then 


N a ; ; 
P(X. =y|X,=2) = ( ) (4)"(1- ae (Binomial formula) 
Y 


Example with N = 3 


This process becomes complicated to do by hand when JN is large. We can use 
small N to see how to use first-step analysis to answer our questions. 


Transition diagram: 


Exercise: find the missing probabilities a, b, c, and d when N = 3. Express 
them all as fractions over the same denominator. 


Probability the harmful allele A dies out 


Suppose the process starts at generation 0. One of the three animals has the 
harmful allele A. Define a suitable notation, and find the probability that the 
harmful allele A eventually dies out. 


Exercise: answer = 2/3. 
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Expected number of generations to fixation 


Suppose again that the process starts at generation 0, and one of the three 
animals has the harmful allele A. Eventually all animals will have the same 
allele, whether it is allele A or B. When this happens, the population is said to 
have reached fixation: it is fixed for a single allele and no further changes are 
possible. 


Define a suitable notation, and find the expected number of generations to 
fixation. 


Exercise: answer = 3 generations on average. 


Things get more interesting for large N. When N = 100, and x = 10 animals 
have the harmful allele at generation 0, there is a 90% chance that the harmful 
allele will die out and a 10% chance that the harmful allele will take over the 
whole population. The expected number of generations taken to reach fixation 
is 63.5. If the process starts with just « = 1 animal with the harmful allele, 
there is a 99% chance the harmful allele will die out, but the expected number of 
generations to fixation is 10.5. Despite the allele being rare, the average number 
of generations for it to either die out or saturate the population is quite large. 


Note: The model above is also an example of a process called the Voter Process. 
The N individuals correspond to N people who each support one of two political 
candidates, A or B. Every day they make a new decision about whom to support, 
based on the amount of current support for each candidate. Fixation in the 
genetic model corresponds to concensus in the Voter Process. 
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Chapter 4: Generating Functions 


4.1 


This chapter looks at Probability Generating Functions (PGFs) for discrete 
random variables. PGFs are useful tools for dealing with sums and limits of 
random variables. For some stochastic processes, they also have a special role 
in telling us whether a process will ever reach a particular state. 


By the end of this chapter, you should be able to: 
e find the sum of Geometric, Binomial, and Exponential series; 
e know the definition of the PGF, and use it to calculate the mean, variance, 
and probabilities; 
e calculate the PGF for Geometric, Binomial, and Poisson distributions; 
e calculate the PGF for a randomly stopped sum; 
e calculate the PGF for first reaching times in the random walk; 
e use the PGF to determine whether a process will ever reach a given state. 


Common sums 


1. Geometric Series 


= 1 
14 Prtrt... = 5 ——— hen |r| < 1. 
rtretr 2" i Ww. |r| 


This formula proves that )>°) P(X =x) = 1 when X ~ Geometric(p): 


S| p(1—p)’ 
xz=0 


= p>) (1-p)* 
x=0 


Dp 


P(X=2)=p(l—p)” > SY P(X=2) 


= ——— (because |1 — p| < 1) 


l=(l= )) 
= ai. 
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2. Binomial Theorem For any p,q € R, and integer n, 


(p+q)"” = . (*)pran* 


xz=0 


Note that Ol gee eee ("C;,. button on calculator.) 
x (n— 2x)! a! 


The Binomial Theorem proves that }>"_,) P(X = 2) = 1 when X ~ Binomial(n, p): 
Px =a) = (")pra =p) Jorg =U 1,265,005 80 
it 


S P(X = 2) 


| 
M 
es 
8 38 
ene 
s 
8 
— 
| 
= 
i 
8 


= 4 


3. Exponential Power Series 
For any \ € R, Soa =e". 
This proves that 5°, P(X = 2) = 1 when X ~ Poisson()): 


Az 
Lx = 3) = —e* 116) eae — a0 il Oe a 
a! 


i 
pe, 
S 
l| 
8 
| 
M 
=| 
ye 
| 
l 
M 
a 


r nm 
Note: Another useful identity is: e* = lim (1 of * forX ER. 
n 


noo 
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4.2 Probability Generating Functions 


The probability generating function (PGF) is a useful tool for dealing 
with discrete random variables taking values 0,1,2,.... Its particular strength 
is that it gives us an easy way of characterizing the distribution of X + Y when 
X and Y are independent. In general it is difficult to find the distribution of 
a sum using the traditional probability function. The PGF transforms a sum 
into a product and enables it to be handled much more easily. 


Sums of random variables are particularly important in the study of stochastic 
processes, because many stochastic processes are formed from the sum of a 
sequence of repeating steps: for example, the Gambler’s Ruin from Section 2.7. 


The name probability generating function also gives us another clue to the role 
of the PGF. The PGF can be used to generate all the probabilities of the 
distribution. This is generally tedious and is not often an efficient way of 
calculating probabilities. However, the fact that it can be done demonstrates 
that the PGF tells us everything there 1s to know about the distribution. 


Definition: Let X be a discrete random variable taking values in the non-negative 
integers {0,1,2,...}. The probability generating function (PGF) of X is 


Gx(s) = E(s*), for all s € R for which the sum converges. 


Calculating the probability generating function 


Properties of the PGF: 
1. Gx(0) = P(X = 0): 


Gx(0) = 0° x P(X =0)+0' x P(X =1)4+0? x P(X =2)+... 
Gx(0) = P(X =0). 
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2 GDA} Gx(1) = So P(X =2) =S7P(X =2) =1, 


Example 1: Binomial Distribution 


Let X ~ Binomial(n, p), so P(X = x) = (pra hor a Wad eg 
a 


Gx(s) 


| 
Wi 
WH 
8 
ae. 
8 3 
Nia 
s 
8 
Q 
i 
8 


DB 


z=0 


\| 
3 
——~ 
8 38 
a2 
S 
ae 
R 
OQ 
8 


= (ps+q)" by the Binomial Theorem: true for all s. 


Thus G'x(s) = (ps +q)” forall s € R. 


X ~ Bin(n=4, p=0.2) 
Check G'x (0): 


Gx(0) = (px 0+4q)” B 
= PX =0); me 
-20 -10 0 10 
Check Gx (1): : 
Gx(1) = (px1+q)" 


(1)" 
1. 
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Example 2: Poisson Distribution 
rN 
Let X ~ Poisson(A), so P(X = x) = ae fore) Das, 
x! 
ey a (8)! 
Gx(s)= Sos ae ‘ e .> 7 
x=0 z=0 
= ee) forall 5 ER. 
Thus Gx(s)=e*8-) forall sER. 
X ~ Poisson(4) 
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0 
Example 3: Geometric Distribution 
Let X ~ Geometric(p), so P(X = x) = p(l—p)® = pq’ for x = 0,1,2,..., 
where g = 1-—p. 
= X ~ Geom(0.8) 
G'‘x(s) — > Ss“ pq” G(s) to infinity 4 
z=0 ° 
= p>_(as)" ! : 
xz2=0 5 0 5 S 
= for all s such that |qs| < 1. 
1 —qs 
D 1 
Thus Gx(s) = for |s| <-. 
l—q q 
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4.3 Using the probability generating function to calculate probabilities 


The probability generating function gets its name because the power series can 
be expanded and differentiated to reveal the individual probabilities. Thus, 
given only the PGF G'x(s) = E(s*), we can recover all probabilities P(X = x). 


For shorthand, write p, = P(X =z). Then 


Gx(s) = E(s*) = S— pes” = po pres p28" 38° | pas! igs 


Thispg = P(X = 0) =Gx(0). 


First derivative: G(s) = pi + 2pos + 3p3s? + 4p4ys? +... 


Thus p; = P(X =1) = G(0). 


Second derivative: G(s) = 2p) +(3 x 2)p3s+ (4x 3)pys? +... 


1 
This ~= PX =2)= 5 Ox (0). 


Third derivative: G(s) = (3x2x 1)p3+ (4x 3x 2)pist+... 


Example: Let X be a discrete random variable with PGF Gx(s) = =(2 4.857), 
Find the distribution of X. 


D 
Gx(s) = S04 23° Gx(0) = P(X =0) =0 
/ _2 92 j _ _a\_ 4 
fe alam x(0) =P(X =1)=5 
18 1 
no) 8 5 Cx (0) = P(X = 2) = 0 
my 18) i my = = _3 
GO (s)=0Vr>4 16 (s) = P(X =r) =0Wr>4 
r! * 


Thus 
1 with probability 2/5, 
3 with probability 3/5. 


Uniqueness of the PGF 


! 
probabilities po, p1, p2,... 18 determined by the values of the PGF and its deriv- 
atives at s = 0. It follows that the PGF specifies a unique set of probabilities. 


dl n 
The formula p, = P(X =n)={— G (0) shows that the whole sequence of 
i x 


Fact: If two power series agree on any interval containing 0, however small, then 
all terms of the two series are equal. 


Formally: let A(s) and B(s) be PGFs with A(s) = )>?°_) ans”, B(s) = do7-y bns”. 
If there exists some R’ > 0 such that A(s) = B(s) for all —R’ < s < R’, then 
a, = b,, for all n. 


Practical use: If we can show that two random variables have the same PGF in 
some interval containing 0, then we have shown that the two random variables 
have the same distribution. 


Another way of expressing this is to say that the PGF of X tells us everything 
there is to know about the distribution of X. 


4.4 Expectation and moments from the PGF 


As well as calculating probabilities, we can also use the PGF to calculate the 
moments of the distribution of X. The moments of a distribution are the mean, 
variance, etc. 


Theorem 4.4: Let X be a discrete random variable with PGF Gx(s). Then: 


1. E(X) = GY (1). 


2. E{X(X Ae 9)... (eh i} =) = 
(This is the kth factorial moment of X .) 


Proof: (Sketch: see Section 4.8 for more details) 
X ~ Poisson(4) 


CO 


Cae) = So ds, 


x=0 o 
oe) 


so GY(s) = pam. 
x=0 


G(s) 


= Gy (1) = >> ap, = E(X) . 


0.0 0.5 1.0 1.5 


| 
es) 
— 
ia 
Ss 
| 
J 
‘be 
| 
Ay 
Se 
| 
> 
+ 
ne 
Ww 
CI 
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Example: Let X ~ Poisson(A). The PGF of X is Gx(s) = e6-). Find E(X) 
and Var(X). X ~ Poisson(4) 


Solution: 


Ner(s-D) 


G'x (8) 


G(s) 


= GOO = Gi) =) 


For the variance, consider 
0.0 0.5 1.0 15 


E{X(X = 1) = Gl(1) =D), = d?. 
So 
Var(X) = E(X*) — (EX)? 
= E{ X(X i} EX — (EX) 


= V+A-» 
= Di 


4.5 Probability generating function for a sum of independent r.v.s 
One of the PGF’s greatest strengths is that it turns a sum into a product: 
E Ca) _—E Ge) . 


This makes the PGF useful for finding the probabilities and moments of a sum 
of independent random variables. 


Theorem 4.5: Suppose that X,,...,X, are independent random variables, and 
let Y = X,+...4+ X,. Then 


Gy(s) = Il Gx, (8). 
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= E(s™)E(s*?)...E(s**) 


(because X,,..., X,, are independent) 


= II Gyx,(s). as required. O 
i=l 


Example: Suppose that X and Y are independent with X ~ Poisson(A) and 


4.6 


.6 Randomly stopped sum 


Y ~ Poisson(j). Find the distribution of X + Y. 
Solution: Gxruy(s) = Gx(s)- Gy(s) 
_ er(s-D pu(s-1) 


— eQ+H(5-0), 


But this is the PGF of the Poisson(\ + 1) distribution. So, by the uniqueness of 
PGFs, X + Y ~ Poisson(\ + 1). 


RATIONAL BANK OF REMUERA 


Remember the randomly stopped sum model from 
Section 3.4. A random number N of events occur, 

and each event 7 has associated with it a cost or 
reward X;. The question is to find the distribution 

of the total cost or reward: Ty = X, + Xo+...+ Xy. 
Ty is called a randomly stopped sum because it has a random number of terms. 


Example: Cash machine model. N customers arrive during the day. Customer 2 


withdraws amount X;. The total amount withdrawn during the day is Ty = 
Xyt... + Xp. 


In Chapter 3, we used the Laws of Total Expectation and Variance to show 
that E(Ty) = »E(N) and Var(Ty) = 0? E(N) + p? Var(N), where p = E(X;) 


__NEW ZEALAND 


and o? = Var(X;). 


In this chapter we will now use probability generating functions to investigate 


the whole distribution of Ty. 


Theorem 4.6: Let X1, X9,.. 


Then the PGF of Ty is: 


Proof: 


Gry (s) 


Cia= Gn(Gx(s)). 


= E(s?”) —FE (oan) 


N ) \ (conditional expectation) 


Ey{E (s®...s**|N) } 


DQ 
2 
i) 
be 
w& 
We 
ea a 
oS 
S 
Q. 
io) 
= 
= 
Ss 
o 
QD 
2 
LY 
L] 


. be a sequence of independent and identically dis- 
tributed random variables with common PGF G'y. Let N be arandom variable, 
independent of the X;’s, with PGF Gy, and let Ty = X,+...+Xy = 3i%, Xi. 


an 
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Example: Let X,,Xo,... and N be as above. Find the mean of Ty. 


E(Tx) = Gy (1) = Gn(Gx(s)) 


s=1 


= Gly(Gx(s)) - G@x(s) 


s=1 


= Gy(1)-Gy(1) Note: Gx (1) = 1 for any rv. X 


= E(N)-E(X), — same answer as in Chapter 3. 


Example: Heron goes fishing 


My aunt was asked by her neighbours to feed the prize 
goldfish in their garden pond while they were on holiday. 
Although my aunt dutifully went and fed them every day, 
she never saw a single fish for the whole three weeks. It 
turned out that all the fish had been eaten by a heron 
when she wasn’t looking! 


Let N be the number of times the heron visits the pond 
during the neighbours’ absence. Suppose that N ~ Geometric(1 — 9), 

so P(N =n) = (1— 6)6", for n = 0,1,2,.... When the heron visits the pond 
it has probability p of catching a prize goldfish, independently of what happens 
on any other visit. (This assumes that there are infinitely many goldfish to be 
caught!) Find the distribution of 


T = total number of goldfish caught. 


Solution: 
1 if heron catches a fish on visit 7, 
Let X; = : 
0 otherwise. 
Then T = X,+ X2+...+ Xy (randomly stopped sum), so 


Gre =Cyier®): 


Now 


So 


giving: 


G'7(s) 


1-—@ 


1—0(1 — p+ ps) 


1-90 


1—6+ 0p — Ops 


[could this be Geometric? G'7(s) = ; 


1—@? 


— (1—0+ Op) — Ops 


( 


1-6 
1—@+0p 


(1 —0+ 6p) — Ops 


1—@+ Op 


) 


T 


TS 
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1—0+0p — @p 
1—6+ 6p 
Op 
— a a 
(es): 
oe 
1—6+ 6p 


ju(— 2 
1—0+60p)° 


This is the PGF of the Geometric (1 — ts} distribution, so by unique- 


1—6+ 6p 
ness of PGFs, we have: 


1-¢@ 
T ad Geometric (—) . 


Why did we need to use the PGF? 


We could have solved the heron problem without using the PGF, but it is much 
more difficult. PGFs are very useful for dealing with sums of random variables, 
which are difficult to tackle using the standard probability function. 


Here are the first few steps of solving the heron problem without the PGF. 
Recall the problem: 


e Let N ~ Geometric(1 — 6), so P(N = n) = (1 — 6)0"; 


e Let X 1, X2,... be independent of each other and of N, with X; ~ Binomial(1, p) 
(remember X; = 1 with probability p, and 0 otherwise); 


e Let T= X,+...+ Xy be the randomly stopped sum; 
e Find the distribution of T. 
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Without using the PGF, we would tackle this by looking for an expression for 
P(T = t) for any t. Once we have obtained that expression, we might be able 
to see that T has a distribution we recognise (e.g. Geometric), or otherwise we 
would just state that T is defined by the probability function we have obtained. 


To find P(T’ = t), we have to partition over different values of N: 
P(T =t)=) P(T =t|N =n)P(N =n). (x) 
n=0 


Here, we are lucky that we can write down the distribution of T | N =n: 


e if N = n is fixed, then 7 = X,+...+ X, is a sum of n independent 
Binomial(1, p) random variables, so (['| N =n) ~ Binomial(n, p). 


For most distributions of X, it would be difficult or impossible to write down the 
distribution of X;+...+ X,: 


we would have to use an expression like 


P(Mit...4¢Xv=t|\Naen=yy.. Ye { P(X = 21) 


Back to the heron problem: we are lucky in this case that we know the distri- 
bution of (1’| N =n) is Binomial(N = n, p), so 


Prati =n) = (any fore UsMy, anne 


Continuing from (x): 
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= wal 
As it happens, we can evaluate the sum in (*x) using the fact that Negative 
Binomial probabilities sum to 1. You can try this if you like, but it is quite 
tricky. [Hint: use the Negative Binomial (t + 1,1 — (1 — p)) distribution. ] 


1-0 
Overall, we obtain the same answer that T7 ~ Geometric | —————— }, but 
1—6+ 6p 


hopefully you can see why the PGF is so useful. 


Without the PGE, we have two major difficulties: 


1. Writing down P(T =t| N =n); 
2. Evaluating the sum over n in (xx). 


For a general problem, both of these steps might be too difficult to do without 
a computer. The PGF has none of these difficulties, and even if Gr(s) does not 
simplify readily, it still tells us everything there is to know about the distribution 
of T. 


4.7 Summary: Properties of the PGF 


Definition: Gx(s) = E(s*) 
Used for: Discrete r.v.s with values 0, 1, 2, ... 
Moments: E(X) = G(1) E{ X(X ene eee i) = G1) 


1 vn 
Probabilities: P(X =n) = —G)(0) 
n 


Sums: Gxiy(s) = Gx(s)Gy(s) for independent X, Y 
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4.8 Convergence of PGFs 


We have been using PGFs throughout this chapter without paying much at- 
tention to their mathematical properties. For example, are we sure that the 
power series Gx(s) = }>) s*P(X = x) converges? Can we differentiate and 
integrate the infinite power series term by term as we did in Section 4.4? When 
we said in Section 4.4 that E(X) = G4,(1), can we be sure that Gx(1) and its 
derivative Gy (1) even exist? 


This technical section introduces the radius of convergence of the PGF. 
Although it isn’t obvious, it is always safe to assume convergence of Gx(s) at 
least for |s| < 1. Also, there are results that assure us that E(X) = G4(1) will 
work for all non-defective random variables X. 


Definition: The radius of convergence of a probability generating function is a 
number R > 0, such that the sum G'x(s) = S>)s"P(X = x) converges if 
|s| < R and diverges (— co) if |s| > R. 


(No general statement is made about what happens when |s| = R.) 


Fact: For any PGF, the radius of convergence exists. 
It is always > 1: every PGF converges for at least s € (—1,1). 


The radius of convergence could be anything from R= 1 to R= oo. 


Note: This gives us the surprising result that the set of s for which the PGF G‘x(s) 
converges is symmetric about 0: the PGF converges for all s € (—R,R), and 
fornos<—Rors>R. 


This is surprising because the PGF itself is not usually symmetric about 0: Le. 
Gx(—s) # Gx(s) in general. 


Example 1: Geometric distribution 


Let X ~ Geometric(p = 0.8). What is the radius of convergence of G'x(s)? 
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As in Section 4.2, 


Gx(s) = 5 — s"(0.8)(0.2)” = 0.85 (0.2s)" 
x=0 xz=0 
=e for all s such that |0.2s| < 1. 
1 —0.2s 


This is valid for all s with |0.2s| < 1, so it is valid for all s with |s| < 5 =5. 
(1.e. -5 << 5s <5.) 
The radius of convergence is R = 5. 


The figure shows the PGF of the Geometric(p = 0.8) distribution, with its 
radius of convergence R = 5. Note that although the convergence set (—5, 5) is 
symmetric about 0, the function Gx(s) = p/(1 — qs) = 4/(5 — s) is not. 


Geometric(0.8) probability generating function 
G(s) to infinity f 


1 
1 
1 
1 
1 
nx 4 1 
1 
f 
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5 0 
1 
———————— Radius of Convergence ——————“ 


In this region, p/(1—qs) remains finite and well—behaved, 
but it is no longer equal to E(s*). 


At the limits of convergence, strange things happen: 


e At the positive end, as s + 5, both Gx(s) and p/(1— qs) approach infinity. 
So the PGF is (left)-continuous at +R: 


lim G'x(s) _ Gx (5) = ©. 


However, the PGF does not converge at s = +R. 


e At the negative end, as s | —5, the function p/(1 — qs) = 4/(5 — s) is 


continuous and passes through 0.4 when s = —5. However, when s < 
—5, this function no longer represents G(s) = 0.8 >*,(0.25)", because 
|0.25| > 1. 


Additionally, when s = —5, Gx(—5) = 0.8)°°°,(—1)* does not exist. 
Unlike the positive end, this means that G‘x(s) is not (right )-continuous 
at =: 

he Gx(s) =0.4# Gx(-5). 


Like the positive end, this PGF does not converge at s = —R. 
Example 2: Binomial distribution 


Let X ~ Binomial(n, p). What is the radius of convergence of G'x(s)? 


As in Section 4.2, 


Gx(s) = Sos" (pra 


| 
[“} 
aa 
8 38 
ee 
S 
— 
8 
1S 
i 
8 


(ps+q)" by the Binomial Theorem: true for all s. 


This is true for all —co < s < ov, so the radius of convergence is R = oo. 
Abel’s Theorem for continuity of power series at s = 1 


Recall from above that if X ~ Geometric(0.8), then Gx(s) is not continuous 
at the negative end of its convergence (—R): 


lim Gx(s) # Gx(—5). 


Abel’s theorem states that this sort of effect can never happen at s = 1 (or at 
+R). In particular, Gx(s) is always left-continuous at s = 1: 


lim Gx(s) = Gx(1) always, even if Gx(1) = oo. 


Theorem 4.8: Abel’s Theorem. 
Let G(s) = > pe for any po, 1, p2,... with p; > 0 for all i. 
i=0 


Then G(s) is left-continuous at s = 1: 
lim G(s) = py = GUL), 
stl (s) » (1) 


whether or not this sum is finite. 


Note: Remember that the radius of convergence R > 1 for any PGF, so Abel’s 
Theorem means that even in the worst-case scenario when R = 1, we can still 
trust that the PGF will be continuous at s = 1. (By contrast, we can not be 
sure that the PGF will be continuous at the the lower limit —R). 


Abel’s Theorem means that for any PGF, we can write Gx(1) as shorthand for 
limsy1 Gx(s). 


It also clarifies our proof that E(X) = G4.(1) from Section 4.4. If we assume 
that term-by-term differentiation is allowed for G'x(s) (see below), then the 
proof on page 81 gives: 


CO 


Gx(s) = », 8S" Dz, 


c=0 


so Geis) = as Ps (term-by-term differentiation: see below). 
r= 
Abel’s Theorem establishes that E(X) is equal to limyy Gy (s): 
E(X) = > 2p, 
x=1 


= Gx(1) 
= limGx(s), 
because Abel’s Theorem applies to G(s) = 5°, xs*"'p,, establishing that 


G(s) is left-continuous at s = 1. Without Abel’s Theorem, we could not be 
sure that the limit of Gy(s) as s ¢ 1 would give us the correct answer for E(X). 


Absolute and uniform convergence for term-by-term differentiation 


We have stated that the PGF converges for all |s| < R for some R. In fact, 
the probability generating function converges absolutely if |s| < R. Absolute 
convergence is stronger than convergence alone: it means that the sum of abso- 
lute values, )>**., |s*P(X = x)|, also converges. When two series both converge 
absolutely, the product series also converges absolutely. This guarantees that 
G'x(s) x Gy(s) is absolutely convergent for any two random variables X and Y. 
This is useful because Gx(s) x Gy(s) = Gx+y(s) if X and Y are independent. 


The PGF also converges uniformly on any set {s : |s| < R’} where R’ < R. 
Intuitively, this means that the speed of convergence does not depend upon the 
value of s. Thus a value ng can be found such that for all values of n > no, 
the finite sum $7", s*P(X = 2) is simultaneously close to the converged value 
Gx(s), for all s with |s| < R’. In mathematical notation: Ve > 0, dno € 
Z such that Vs with |s| < R’, and Vn > no, 


n 


z=0 


» s*P(X = 2) — Gx(s) 


<<. 


Uniform convergence allows us to differentiate or integrate the PGF term by 


term. 


Fact: Let Gx(s) = 


il G()= ( 


E(s*) = 


CO 


‘» 3 P 


xz=0 


y1 9 s"P(X = x), and let s < R. 


(X= ») oe < (s PX = a=) as” P(X =2). 


00 b 
-»)) w= (/ “P(X = 1) ds 
x=0 sg 
ce gttl 
oe P(X =x) for -R<a<b<R 
i ee oat 


(term by term integration). 


poacg THE UNIVERSITY 
OF AUCKLAND 
Te Whare Wananga 


pramakiMakauruy § 95 


4.9 Special Process: the Random Walk 


We briefly saw the Drunkard’s Walk in Chapter 1: a drunk person staggers 
to left and right as he walks. This process is called the Random Walk in 
stochastic processes. Probability generating functions are particularly useful 
for processes such as the random walk, because the process is defined as the 
sum of a single repeating step. The repeating step is a move of one unit, left 
or right at random. The sum of the first t steps gives the position at time f. 


The transition diagram below shows the symmetric random walk (all transitions 
have probability p = 1/2.) 


1/2 1/2 12 1/2 1/2 1/2 


Question: 


What is the key difference between the random walk and the gambler’s ruin? 


The random walk has an INFINITE state space: it never stops. The gambler’s 
ruin stops at both ends. 


This fact has two important consequences: 


e The random walk is hard to tackle using first-step analysis, because we 
would have to solve an tnfinite number of simultaneous equations. In this 
respect it might seem to be more difficult than the gambler’s ruin. 


Because the random walk never stops, all states are equal. 


In the gambler’s ruin, states are not equal: the states closest to 0 are 
more likely to end in ruin than the states closest to winning. By contrast, 
the random walk has no end-points, so (for example) the distribution of 
the time to reach state 5 starting from state 0 is exactly the same as the 
distribution of the time to reach state 1005 starting from state 1000. We 
can exploit this fact to solve some problems for the random walk that 
would be much more difficult to solve for the gambler’s ruin. 
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PGFs for finding the distribution of reaching times 
For random walks, we are particularly interested in reaching times: 
e How long will it take us to reach state 7, starting from state 7? 


e Is there a chance that we will never reach state 7, starting from state 7? 


In Chapter 3 we saw how to find expected reaching times: the expected 
number of steps taken to reach a particular state. We used the law of total 
expectation and first-step analysis (Section 3.5). 


However, the expected or average reaching time doesn’t tell the whole story. 
Think back to the model for gene spread in Section 3.7. If there is just one 
animal out of 100 with the harmful allele, the expected number of generations to 
fixation is quite large at 10.5: even though the allele will usually die out after one 
or two generations. The high average is caused by a small chance that the allele 
will take hold and grow, requiring a very large number of generations before it 
either dies out or saturates the population. In most stochastic processes, the 
average is of limited use by itself, without having some idea about the variance 
and skew of the distribution. 


With our tool of PGFs, we can characterise the whole distribution of the time 
T taken to reach a particular state, by finding its PGF. This will give us the 
mean, variance, and skew by differentiation. In principle the PGF could even 
give us the full set of probabilities, P(Z’ = t) for all possible t = 0,1,2,..., 
though in practice it may be computationally infeasible to find more than the 
first few probabilities by repeated differentiation. 


However, there is a new and very useful piece of information that the PGF can 
tell us quickly and easily: 


what is the probability that we NEVER reach state , starting from state 1? 


For example, imagine that the random walk represents the share value for an 
investment. The current share price is 7 dollars, and we might decide to sell 
when it reaches 7 dollars. Knowing how long this might take, and whether there 
is a chance we will never succeed, is fundamental to managing our investment. 
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To tackle this problem, we define the random variable T' to be the time taken 
(number of steps) to reach state j, starting from state i. We find the PGF of 
T, and then use the PGF to discover P(T = oo). If P(T = oo) > 0, there is a 
positive chance that we will NEVER reach state 7, starting from state 2. 


We will see how to determine the probability of never reaching our goal in 
Section 4.11. First we will see how to calculate the PGF of a reaching time T' 
in the random walk. 


Finding the PGF of a reaching time in the random walk 


1/2 1/2 172 1/2 172 1/2 1/2 


Define T;; to be the number of steps taken to reach state j, starting at state 1. 
Tj; is called the first reaching time from state 7 to state 7. 


We will focus on 7); = number of steps to get from state 0 to state 1. 


Problem: Let H(s) =E(s"") be the PGF of To. Find H(s). 


Arrived! 
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Solution: 
Let Y,, be the step taken at time n: up or down. For the symmetric random walk, 
= 1_ with probability 0.5, 
"| -1. with probability 0.5, 


and Y), Y2,... are independent. 


Recall T;,; = number of steps to get from state i to state j for any i, j, 


and H(s) = E(s’") is the PGF required. 
Use first-step analysis, partitioning over the first step Y,: 
H(s) = E(s*™) 
= E(s"|Y, =1)P(% =1)+ E(s|Y, =—-1) P(% = -1) 


1 
= 5 {EG =) +B" =-1)}. @ 


Now if Y, = 1, then To = 1 definitely, so E (s™|Y; =1) =s'=s. 


yy >=), Metin 1A- Aa: 


— one step from state 0 to state —1, 

— then T_, steps from state —1 to state 1. 
But T_1, = T_1 + Toi, because the process must pass through 0 to get from —1 
to 1. 


Now Tj and To; are independent (Markov property). Also, they have the 
same distribution because the process 1s translation invariant (1.e. all states are 
the same): 
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Thus 
E(@"|Yy=—-1) = Efe) 
_E (sit?-10+2.1) 
= sE(s’°) E(s’) by independence 
= s(H(s))* because identically distributed. 
Thus 


H(s)=5{sts(H(s)))} bya. 


This is a quadratic in H(s): 


fits Hea =o 


2 
lt,/1-43s3s8 14,4 —-— 
SHG) = = 


S S 


Which root? We know that P(7o, = 0) = 0, because it must take at least one step 
— ’ 2 

to go from 0 to 1. With the positive root, lim,.9 H(0) = lim,5o (=) = 66> 50 
S 


we take the negative root instead. 


1-V1— = 
thes. Wigja- = 


Check this has lim,_,) H(s) = 0 by L’Hospital’s Rule: 


ta (Ga) = 18 Gra) 


|| 
oS. 
Le 
i 
Dole 
a 
— 
| 
w 
a|— 
ma 
~ 
bo 
x 
Le) 
H 
————— 
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Notation for quick solutions of first-step analysis for finding PGFs 


As with first-step analysis for finding hitting probabilities and expected reaching 
times, setting up a good notation is extremely important. Here is a good 
notation for finding H(s) = E (s’"). 


Let T = Ty. Seek H(s) = E(s*). 
Now 


1 with probability 1/2, 
~ | 147"+T" with probability 1/2, 
where T’ ~ T” ~ T and T’, T” are independent. 


Taking expectations: 


E (s') w. p. 1/2 
H(s) — E(s") ~~ 147" +7" 

E (s ) w.p.1/2 
s w. p. 1/2 

= H(s) = T’ T" ; / 1 
sE (s")E(s* ) w.p. 1/2 (by independence of T’ and T") 

w. p. 1/2 

> H(s) = 4° 

sH(s)H(s) w.p. 1/2 (because T’ ~ T” ~ T) 


= 
Feat 
2 
| 
Nle 
WH 

+ 
Nole 
W 
Fat 
WH 
YS 
3 bo 


4.10 
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Thus: 
sH(s)* — 2H(s)+s=0. 


Solve the quadratic and select the correct root as before, to get 


H(s)=——~——_— for |s| < 1. 


Defective random variables 


A random variable is said to be defective if it can take the value oo. 


In stochastic processes, a reaching time T;; 1s defective if there is a chance that 
we NEVER reach state j, starting from state 1. 


The probability that we never reach state 7, starting from state 2, is the same 


as the probability that the time taken is infinite: Tj; = oo: 


P(T;; = co) = P(we NEVER reach state j, starting from state ‘). 


In other cases, we will always reach state 7 eventually, starting from state i. 


In that case, Tj; can not take the value ov: 


P(T;; = co) =0 if we are CERTAIN to reach state j, starting from state i. 


Definition: A random variable T is defective, or improper, if it can take the value 


oo. That 1s, 


T 1s defective if P(T = co) > 0. 
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Thinking of )° 7°, P(T = t) as 1 — P(T = oo) 


Although it seems strange, when we write }>/°) P(T = t), we are not including 
the value t = oo. 


The sum )->/“, continues without ever stopping: at no point can we say we have 
‘finished’ all the finite values of t so we will now add on t = oo. We simply 
never get tot = co when we take )°*~,. 


For a defective random variable JT, this means that 


SPT =2) a1, 


t=0 


because we are missing the positive value of P(T’ = ov). 


All probabilities of 7’ must still sum to 1, so we have 


CoO 


1=) PT =t)+P(T =o), 


in other words 


PGFs for defective random variables 


When T is defective, the PGF of T is defined as the power series 


His) = SPT =1)s for |s| <1, 


The term for P(T’ = o0)s™ is missed out. The PGF is defined as the generating 
function of the probabilities for finite values only. 


ee NEW ZEALAND 


Because H(s) is a power series satisfying the conditions of Abel’s Theorem, we 
know that: 


e H(s) is left-continuous at s = 1, ie. limg H(s) = H(1). 
This is different from the behaviour of E(s*), if T is defective: 


e E(s’) = H(s) for |s| < 1 because the missing term is zero: i.e. because 
s*© = 0 when |s| < 1. 
e E(s’) is NOT left-continuous at s = 1. There is a sudden leap (disconti- 


nuity) at s = 1 because s*® = 0 as sf 1, but s® = 1 when s = 1. 


Thus H(s) does NOT represent E(s’) at s = 1. It is as if H(s) is a ‘train’ that 
E(s7) rides on between —1 < s < 1. At s =1, the train keeps going (i.e. H(s) 
is continuous) but E(s?) jumps off the train. 


We test whether J’ is defective by testing whether or not E(s?) ‘jumps off the 
train’ — that is, we test whether or not H(s) is equal to E(s’) when s = 1. 


We know what E(s7) is when s = 1: 


e E(s7) is always 1 when s = 1, whether T is defective or not: 


E(17) =1 for ANY random variable T.. 
But the function H(s) = )°°,s‘P(T =t) may or may not be 1 when s = 1: 


e If 7 is defective, H(s) is missing a term and H(1) < 1. 


e If 7 is not defective, H(s) is not missing anything so H(1) = 1. 


Test for defectiveness: 


Let H(s) = >°/°, s'P(T = t) be the power series representing the PGF of T 


for |s| < 1. Then T is defective if and only if H(1) < 1. 
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Using defectiveness to find the probability we never get there 


The simple test for defectiveness tells us whether there is a positive probability 
that we NEVER reach our goal. Here are the steps. 


1. We want to know the probability that we will NEVER reach state 7, start- 
ing from state 7. 


2. Define T’ to be the random variable giving the number of steps taken to 
get from state 7 to state 7. 


3. The event that we never reach state 7, starting from state 2, is the same 
as the event that T’ = oo. (If we wait an infinite length of time, we never 
get there.) So 


P(never reach state j | start at state i) = P(T’ = oo). 


4. Find H(s) = >°°,s'P(T = t), using a calculation like the one we did in 
Section 4.9. H(s) is the PGF of T for |s| < 1. We only need to find it for 
|s| < 1. The calculation in Section 4.9 only works for |s| < 1 because the 
expectations are infinite or undefined when |s| > 1. 


5. The random variable T is defective if and only if H(1) < 1. 
6. If H(1) < 1, then the probability that T’ takes the value oo is the missing 
piece: P(T = co) = 1-— A(1). 


Overall: 


P( never reach state j | start at state i) = P(T = co) = 1— H(1). 


Expectation and variance of a defective random variable 


If T’ is defective, there is a positive chance that TJ’ = oo. This means that 
E(T) = oo, Var(T) = oo, and E(T) = oo for any power a. 
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E(T) and Var(7’) can not be found using the PGF when T is defective: you 
will get the wrong answer. 


When you are asked to find E(7’) in a context where T might be defective: 


e First check whether T is defective: is H(1) < 1 or=1? 
e If T is defective, then E(T) = co. 
e If T is not defective (H(1) = 1), then E(7’) = AH'(1) as usual. 


4.11 Random Walk: the probability we never reach our goal 


In the random walk in Section 4.9, we defined the first reaching time 7; as the 
number of steps taken to get from state 0 to state 1. 


In Section 4.9 we found the PGF of 7; to be: 


1-V1—# 
PGF of To, = H(s) = —~——— for |s| < 1. 
S 


Questions: 
a) What is the probability that we never reach state 1, starting from state 0? 


b) What is expected number of steps to reach state 1, starting from state 0? 


Solutions: 
a) We need to know whether Tp, is defective. 


To: is defective if and only if H(1) < 1. 
Now H(1) = YE" = 1. So Thy is not defective. 


Thus 
P(never reach state 1| start from state 0) = 0. 


We will DEFINITELY reach state 1 eventually, even if it takes a very long time. 
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b) Because To; 1s not defective, we can find E(T,) by differentiating the PGF: 
E(To1) = H’(1). 


1— V1 -—s? 


He) = SS ge (s? - ye 
s 
1 = 
So. H'(s) = —s-*+ 3 (s? —1) 2 (—2s~°) 
Thus 
' 1 1 
E(To1) = hes (s) = lim | -— + ————— | =v. 


2 / 
Tall Ss 33 4-1 


So the expected number of steps to reach state I starting from state 0 1s infinite: 
E(To1) = C. 


This result is striking. Even though we will definitely reach state 1, the 
expected time to do so is infinite! In general, we can prove the following results 
for random walks, starting from state 0: 


Property Reach state 1? P(Zo; = 00) E(Zo1) a Pp _— 
rae rr ne ee ee 
p=7= - Guaranteed 0 00 7 
poy Not guaranteed >0 OO 


Note: (Non-examinable) If 7 is defective in the random walk, E(s‘) is not 
continuous at s = 1. In Section 4.9 we had to solve a quadratic equation to find 
H(s) = E(s‘). The negative root solution for H(s) generally represents E(s*) 
for s < 1. At s = 1, the solution for E(s’) suddenly flips from the — root to 
the + root of the quadratic. This explains how E(s?) can be discontinuous as 
s + 1, even though the negative root for H(s) is continuous as s f 1 and all the 
working of Section 4.9 still applies for s = 1. The reason is that we suddenly 
switch from the — root to the + root at s = 1. 


When |s| > 1, the conditional expectations are not finite so the working of 
Section 4.9 no longer applies. 


The problem with being abstract... 
e Each card has a letter on one side and a number on the other. 


e We wish to test the following rule: 


If the card has a D on one side, 
then it has a 3 on the other side. 


e Which card or cards should you turn over, and ONLY these 
cards, in order to test the rule? 


Ata party... 


If you are drinking alcohol, 
you must be 18 or over. 


e Each card has the person’s age on one side, and their drink 
on the other side. 


e Which card or cards should you turn over, and ONLY these 
cards, in order to test the rule? 
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Chapter 5: Mathematical Induction 


5.1 


So far in this course, we have seen some techniques for dealing with stochastic 
processes: first-step analysis for hitting probabilities (Chapter 2), and first-step 
analysis for expected reaching times (Chapter 3). We now look at another tool 
that is often useful for exploring properties of stochastic processes: proof by 
mathematical induction. 


Proving things in mathematics 


There are many different ways of constructing a formal proof in mathematics. 
Some examples are: 


Proof by counterexample: a proposition is proved to be not generally true 
because a particular example is found for which it is not true. 


Proof by contradiction: this can be used either to prove a proposition is 
true or to prove that it is false. To prove that the proposition is true (say), 
we start by assuming that it is false. We then explore the consequences of 
this assumption until we reach a contradiction, e.g. 0 = 1. Therefore something 
must have gone wrong, and the only thing we weren’t sure about was our initial 
assumption that the proposition is false — so our initial assumption must be 
wrong and the proposition is proved true. 


A famous proof of this sort is the proof that there are infinitely many prime 
numbers. We start by assuming that there are finitely many primes, so they 
can be listed as p1,po,...,Pn, Where p, is the largest prime number. But then 
the number p, X po X ... X Pn +1 must also be prime, because it is not divisible 
by any of the smaller primes. Furthermore this number is definitely bigger than 
Dn. SO we have contradicted the idea that there was a ‘biggest’ prime called py, 
and therefore there are infinitely many primes. 


Proof by mathematical induction: in mathematical induction, we start 
with a formula that we suspect is true. For example, I might suspect from 
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observation that $7), k = n(n + 1)/2. I might have tested this formula for 
many different values of n, but of course I can never test it for all values of n. 
Therefore I need to prove that the formula is always true. 


The idea of mathematical induction is to say: suppose the formula is true for 
all n up to the value n = 10 (say). Can I prove that, if it is true for n = 10, 
then it will also be true for n = 11? And jf it is true for n = 11, then it will 
also be true for n = 12? And so on. 


In practice, we usually start lower than n = 10. We usually take the very easiest 
case, n = 1, and prove that the formula is true for n = 1: LHS = See k= 
1 =1 x 2/2 =RHS. Then we prove that, if the formula is ever true for n = a, 
then it will always be true for n = x+ 1. Because it is true for n = 1, it must 
be true for n = 2; and because it is true for n = 2, it must be true for n = 3; 
and so on, for all possible n. Thus the formula is proved. 


Mathematical induction is therefore a bit like a first-step analysis for proving 
things: prove that wherever we are now, the next step will always be OK. Then 
if we were OK at the very beginning, we will be OK for ever. 


The method of mathematical induction for proving results is very important in 
the study of Stochastic Processes. This is because a stochastic process builds 
up one step at a time, and mathematical induction works on the same principle. 


Example: We have already seen examples of inductive-type reasoning in this 


course. For example, in Chapter 2 for the Gambler’s Ruin problem, using 
the method of repeated substitution to solve for p, = P(Ruin| start with $x), 
we discovered that: 


© po = 2p, —1 
© p3 = 3p, — 2 
eps = 4p, —3 


We deduced that p, = xp; — (x — 1) in general. 


To prove this properly, we should have used the method of mathematical 
induction. 


ERSITY 
AND 


5.2 Mathematical Induction by example 


This example explains the style and steps needed for a proof by induction. 


- 1 
Question: Prove by induction that S- k= mae) for any integer n. (x) 
k=1 


Approach: follow the steps below. 


(i) First verify that the formula is true for a base case: usually the smallest appro- 
priate value of n (e.g. n = 0 or n = 1). Here, the smallest possible value of n is 
n = 1, because we can’t have )7)_,. 


First verify (x) is true when n = 1. 


1 


LAS) pt 
k=1 
Rie = "27418 


So (x) is proved for n = 1. 


tN 
=P 
ae 

’ 


Next suppose that formula (x) is true for all values of n up to and including 
some value x. (We have already established that this is the case for x = 1). 


Using the hypothesis that (x) is true for all values of n up to and including z, 
prove that it is therefore true for the value n = x + l. 


Now suppose that (x) is true forn = 1,2,...,2 for some x. 
1 
Thus we can assume that » k= a (a) 


((a) for ‘allowed’ info) 
We need to show that if (x) holds for n = x, then it must also hold forn = x + 1. 
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Require to prove that 


‘ b= (ar Ne + 2) on 


(Obtained by putting n = x + 1 in (x)). 


x+1 


LHS of (xx) = Sok 
k=1 


S- k + («+1) by expanding the sum 
k=1 


1 
= meet + (x +1) using allowed info (a) 
~ ; 
= (x+1) (= -+ 1) rearranging 
. ea Le 2) 
7 2 
= RHS of (xx). 


This shows that: 


- 1 
Soe = MOE when = 0+ 1. 
k=1 


So, assuming (x) is true for n = x, it is also true forn = x + 1. 


(iii) Refer back to the base case: if it is true for n = 1, then it is true forn = 1+1 = 2 
by (ii). If it is true for n = 2, it is true for n = 2+ 1 =3 by (ii). We could go 
on forever. This proves that the formula (x) is true for all n. 


We proved (x) true for n = 1, thus (x) is true for all integers n > 1. O 
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General procedure for proof by induction 


The procedure above is quite standard. The inductive proof can be summarized 
like this: 


Question: prove that f(n) = g(n) for all integers n > 1. (x) 


Base case: n = 1. Prove that f(1) = g(1) using 
LHS = 7(1) 
= g(1)=RHS. 
General case: suppose (x) is true forn = x: 
so 2) =o). (a) (allowed info) 
Prove that (x) is therefore true forn = x +1: 


IP f(e+1) =9(2+1). (xx) 


LHS(xx) = f(x +1) 


___ f{ some expression breaking down f(x + 1) 
into f(x) and an extra term in x + 1 


= { substitute f(x) = g(x) in the line above } by allowed (a) 
= {do some working} 

= g(z+1) 

= RHS(xx). 


Conclude: (x) is proved for n = 1, so it is proved forn = 2, n = 3, 
i re 


(x) is therefore proved for all integers n > 1. O 
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5.3 Some harder examples of mathematical induction 


Induction problems in stochastic processes are often trickier than usual. Here 
are some possibilities: 


e Backwards induction: start with base case n = N and go backwards, 
instead of starting at base case n = 1 and going forwards. 


e Two-step induction, where the proof for n = x 4+ 1 relies not only on the 
formula being true for n = x, but also on it being true for n = x — 1. 


The first example below is hard probably because it is too easy. The second 
example is an example of a two-step induction. 


Example 1: Suppose that po = 1 and p,; = ap;4, for all x = 1,2,.... Prove by 
mathematical induction that p, = 1/a” for n = 0,1,2,.... 


Wish to prove 
1 
Pn = — forn=0,1,2,... (x) 
Q” 


Information given: 


i 
Pr+l = TP (G1) 
(a4 
po = 1 (Gy) 
Base case: n = 0. 
LHS = po = 1 by information given (G2). 
1 1 


RHS = — = -=1=LHS. 
a 1 


Therefore (x) is true for the base case n = 0. 


General case: suppose that (x) is true forn = x, sO we can assume 


1 
Pz = af (a) 
Wish to prove that (x) is also true forn = x + 1: Le. 
1 
RIP (= (xx) 


a 
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il 
LHS of (««) = post = —%X Pr _ by given (G}) 
a 


1 1 

= —x— by allowed (a) 
ee Oe 

ol 

_ qrtl 

= RHS of (xx). 


So if formula (x) is true forn = x, it is true forn = x +1. We have shown it is 
true for n = 0, so it is true for alln = 0,1,2,.... O 


Example 2: Gambler’s Ruin. In the Gambler’s Ruin problem in Section 2.7, 
we have the following situation: 
e p, = P(Ruin| start with $x); 
e We know from first-step analysis that p,.1 = 2p, — Py—1 (G4) 
e We know from common sense that po = 1 (G2) 


e By direct substitution into (G,), we obtain: 


po = 2p,—1 
p3 = 3p, —2 


e We develop a suspicion that for all 7 = 1,2,3,..., 
Pe = xp, — (x — 1) (x) 


e We wish to prove (x) by mathematical induction. 


For this example, our given information, in (G,), expresses p,+; in terms of both 
p, and p,_1, SO we need two base cases. Use x = 1 and x = 2. 


Wish to prove p; = xp, — (x — 1) (x). 
Base case x = 1: 
LHS = P1- 


RHS = 1x p, —-0 =p, = LHS. 
. formula (x) is true for base case x = 1. 
Base case x = 2: 
LHS = pp = 2p, —1__ by information given (G}) 
RHS = 2x p, —1= LHS. 
. formula (x) is true for base case x = 2. 


General case: suppose that (x) is true for all x up tox = k. 
So we are allowed: 


(x =k) Pe = kp, —(k-1) (a1) 
(c =k—1) Pra = (k—1)pi—(k-2) (a2) 


Wish to prove that (x) is also true for x = k + 1, L.e. 
RTP peii=(k+1)pi—k. (xe) 
LHS of (xx) = prit 
= 2p,—pr-1 _ by given information (G') 


= 2{ kp -(k-1)}-¢ (k-I)-(&-2)} 
by allowed (a;) and (az) 


= pq 2k (k i} { 2(k- 1) — («= 2)} 


= (k+1)p.—k 


= RHS_ of (xx) 


So if formula (x) is true for x = k — 1 and x = k, it is true for x = k +1. We have 
shown it is true for x = 1 and x = 2, so it is true for all x = 1,2,3,.... O 
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Chapter 6: Branching Processes: 


The Theory of Reproduction 


Viruses 


Royalty 


wade 


Although the early development of Probability Theory was motivated by prob- 
lems in gambling, probabilists soon realised that, if they were to continue as a 
breed, they must also study reproduction. 


Reproduction is a complicated business, but considerable in- 
sights into population growth can be gained from simplified 
models. The Branching Process is a simple but elegant 
model of population growth. It is also called the Galton- 
Watson Process, because some of the early theoretical re- 
sults about the process derive from a correspondence between 
Sir Francis Galton and the Reverend Henry William Watson 
in 1873. Francis Galton was a cousin of Charles Darwin. In 
later life, he developed some less elegant ideas about repro- 
duction — namely eugenics, or selective breeding of humans. 
Luckily he is better remembered for branching processes. 
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6.1 Branching Processes 


Consider some sort of population consisting of reproducing individuals. 


Examples: living things (animals, plants, bacteria, royal families); 
diseases; computer viruses; 
rumours, gossip, lies (one lie always leads to another!) 


Start conditions: start at ime n = 0, with a single individual. 


Each individual: lives for | unit of time. At time n = 1, it produces a family of 
offspring, and immediately dies. 


How many offspring? Could be 0, 1, 2, ...._ This is the family size, Y. (“Y” 
stands for “number of Young”). 


Each offspring: lives for 1 unit of time. At time n = 2, it produces its own family 
of offspring, and immediately dies. 


and so on... 


Assumptions 
1. All individuals reproduce independently of each other. 
2. The family sizes of different individuals are independent, identically dis- 


tributed random variables. Denote the family size by Y (number of Young). 


Family size distribution, Y Pix =k) = pe, 


wo: AUCKLAND 
Te Whare Wananga o Tamaki | 


Definition: A branching process is defined as follows. 


e Single individual at time n = 0. 


e Every individual lives exactly one unit of time, then produces Y offspring, 


and dies. 


e The number of offspring, Y, takes values 0, 1, 2, ..., and the probability 


of producing & offspring is P(Y = k) = px. 


e All individuals reproduce independently. Individuals 1,2,...,n have family 


sizes Y;, Y2,...Y,, where each Y; has the same distribution as Y . 


e Let Z, be the number of individuals born at time n, forn = 0,1, 2,.... 


Interpret ‘Z,,’ as the ‘siZe’ of generation n. 
e Then the branching process is {Z, 21, Z2, Z3,...$} ={Z, : n © N}. 


Definition: The state of the branching process at time n is z,, Where each z, can 


take values 0,1, 2,3,.... Note that z = 1 always. 
Zn represents the size of the population at time n. 


Note: When we want to say that two random variables X and Y have the same 


distribution, we write: X ~ Y. 
For example: Y; ~ Y, where Y; 1s the family size of any individual i. 


Note: The definition of the branching process is easily generalized to start with 


more than one individual at time n = 0. 


Branching Process 
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6.2 Questions about the Branching Process 


When we have a situation that can be modelled by a branching process, there 
are several questions we might want to answer. 


If the branching process is just beginning, what will happen in the future? 
1. What can we find out about the distribution of Z, (the population siZe at 


generation n)? 


e can we find the mean and variance of Z,,? 
— yes, using the probability generating function of family size, Y ; 


can we find the whole distribution of Z,,? 
— for special cases of the family size distribution Y , we can find the PGF of 
Z,, explicitly; 


can we find the probability that the population has become extinct by 
generationn, P(Z,=0) ? 
— for special cases where we can find the PGF of Z,, (as above). 


2. What can we find out about eventual extinction? 


e can we find the probability of eventual extinction, P ( lim 24 = 0) ? 


no 


— yes, always: using the PGF of Y. 


e can we find general conditions for eventual extinction? 
— yes: we can find conditions that guarantee that extinction will occur with 
probability 1. 


e if eventual extinction is definite, can we find the distribution of the time to 


extinction? 
— for special cases where we can find the PGF of Z,, (as above). 


Example: Modelling cancerous growths. Will a colony of cancerous cells become 
extinct before it is sufficiently large to overgrow the surrounding tissue? 
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If the branching process is already in progress, what happened in the past? 


1. How long has the process been running? 
e how many generations do we have to go back to get to the single common 


ancestor? 


2. What has been the distribution of family size over the generations? 


3. What is the total number of individuals (over all generations) up to the present 
day? 


Example: It is believed that all humans are descended from a single female an- 


cestor, who lived in Africa. How long ago? 

— estimated at approximately 200,000 years. 

What has been the mean family size over that period? 
— probably very close to 1 female offspring per 


female adult: e.g. estimate = 1.002. 


6.3 Analysing the Branching Process 


Key Observation: every individual in every generation starts a new, independent 
branching process, as if the whole process were starting at the beginning again. 


SO SA Py 
% ue 
Go SPS QPS? 
ee Ge 
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Zn as arandomly stopped sum 


Most of the interesting properties of the branching process centre on the distri- 
bution of Z,, (the population size at time n). Using the Key Observation from 
overleaf, we can find an expression for the probability generating function of 
Zn: 


Consider the following. 


e The population size at time n — | 1s given by Z,,_1. 
e Label the individuals at time n — 1 as 1,2,3,..., Zy_1. 


e Fach individual 1, 2,..., Z,—1 starts anew branching process. Let Y,, Y2,...,Y¥z,_, 
be the random family sizes of the individuals 1, 2,..., Zy—1. 


e The number of individuals at time n, Z,, 1s equal to the total number of 
offspring of the individuals 1,2,..., Z,—1. That 1s, 


Zn-1 
Lp 
i=l 


Thus Z,, is a randomly stopped sum: a sum of Y,, Y2,..., randomly stopped 
by the random variable Z,,_;. 


Note: 1. Each Y; ~ Y: that is, each individual i = 1,..., Z,_; has the same 
family size distribution. 


ViVi wig VG 


n—-1 


are independent. 


THE UNIVERSITY 
OF AUCKLAND 
NEW ZEALAND 
TeWhareWanangaoramakiMakeuru | 


Probability Generating Function of Z, 


Let Gy(s) = E(s*) be the probability generating function of Y. 
(Recall that Y is the number of Young of an individual: the family size.) 


Now Z,, is arandomly stopped sum: it is the sum of Yj, Y2,..., stopped by the 
random variable Z,_;. So we can use Theorem 4.6 (Chapter 4) to express the 
PGF of Z,, directly in terms of the PGFs of Y and Z,_1. 


By Theorem 4.6, if Z, =Y,+Yo+...+ Yz,,, and Z,_1 is itself random, then 
the PGF of Z, is given by: 
Gz,(8) = Gz,..(Gr(s)), (#) 


where G'z__, 1s the PGF of the random variable Z,,_. 


For ease of notation, we can write: 


Gz,(s) = G,(s), Gz,_,(s) = Gn_1(s), and so on. 


Note that Z, = Y (the number of individuals born at time n = 1), 
so we can also write: 


Gy(s) = G(s) = G(s) (for simplicity). 
Thus, from (@), 


Gut (G (s)) (Branching Process Recursion Formula.) 


Note: 
1. G,(s) = E (s"), the PGF of the population size at time n, Z,. 


2. Gn—i(s) = E (s*-), the PGF of the population size at time n — 1, Z;,_1. 


3. G(s) = E (s”) = E (s*!), the PGF of the family size, Y. 


We are trying to find the PGF of Z,, the population size at time n. 
So far, we have: Gyls) = Gna (G(s)). (x) 
But by the same argument, 
Gy1(r) = Gn2(G(r)). 
(use r instead of s to avoid confusion in the next line.) 
Substituting in (x), 


Oi) = C.4 (G(s) 


= Gy (G (a(s))) replacing r = G(s). 


By the same reasoning, we will obtain: 


G(s) =Gy 3 (a(G(G(s)))), 
eae 3 times 
and so on, until we finally get: 


Gils) = Ga-m(@(@(a(...G()...)))) 


——_ ee 
nm — 1 times 


Ce) 


= G(G(G(...G(s)...))). 


n times 


| 
2 


We have therefore proved the following Theorem. 
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Theorem 6.3: Let G(s) = E(s*) = [po pys" be the PGF of the family size 


distribution, Y. Let Z = 1 (start from a single individual at time 0), and let 
Z,, be the population size at time n (n = 0,1,2,...). Let G,(s) be the PGF of 
the random variable Z,,. Then 


Note: G,,(s) = G(a(a( G(s). =) is called the n-fold iterate of G. 


6.4 


—— ee 
n times 


We have therefore found an expression for the PGF of the population size at 
generation n, although there is no guarantee that it is possible to write it down 
or manipulate it very easily for large n. For example, if Y has a Poisson(A) 
distribution, then G(s) = e**—)), and already by generation n = 3 we have the 
following fearsome expression for G'3(s): 


G(s) = , (oe) 1) | 


(Or something like that!) 


However, in some circumstances we can find quite reasonable closed-form ex- 
pressions for G’,(s), notably when Y has a Geometric distribution. In addition, 


for any distribution of Y we can use the expression G,,(s) = Gn1(G(s)) to 


derive properties such as the mean and variance of Z,,, and the probability of 
eventual extinction (P(Z,, = 0) for some n). 


What does the distribution of Z,, look like? 


Before deriving the mean and the variance of Z,, it is helpful to get some 
intuitive idea of how the branching process behaves. For example, it seems rea- 
sonable to calculate the mean, E(Z,,), to find out what we expect the population 
size to be in n generations time, but why are we interested in Var(Z,,)? 


The answer is that Z, usually has a “boom-or-bust” distribution: either the 
population will take off (boom), and the population size grows quickly, or the 
population will fail altogether (bust). In fact, if the population fails, it is likely 
to do so very quickly, within the first few generations. This explains why we are 
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interested in Var(Z,,). A huge variance will alert us to the fact that the process 
does not cluster closely around its mean values. In fact, the mean might be 
almost useless as a measure of what to expect from the process. 


Simulation 1: Y ~ Geometric(p = 0.3) 


The following table shows the results from 10 simulations of a branching process, 
where the family size distribution is Y ~ Geometric(p = 0.3). 


Simulation Zo Z Zo 23 Z4 Zs Z6 Zz Ze Zo Zi0 


i! 1 0 O O O 0 0 0 0 0 0 
2 LL fi 2 @ 0 0 0 0 0 0 
3 1 4 19 42 81 181 483 964 2276 5383 12428 
zi 1 3 38 5 38 15 29 86 207 435 952 
5 L 2 © & @ 0 0 0 0 0 0 
6 tL + @ 2 @ 0 0 0 0 0 0 
7 1 2 8 26 68 162 360 845 2039 4746 10941 
8 i if & 2 0 0 0 0 0 0 0 
u i i «@ 0 0 0 0 0 0 
10 1 1 4 13 #18 39 104 294 690 1566 3534 


Often, the population is extinct by generation 10. However, when it is not 
extinct, it can take enormous values (12428, 10941, ...). 


The same simulation was repeated 5000 times to find the empirical distribu- 
tion of the population size at generation 10 (Zj9). The figures below show 
the distribution of family size, Y, and the distribution of Zo from the 5000 


simulations. 
Family Size, Y Z10 
o 
o [o) 
=) 
S 
NWN 4 
o oO 
+ 
[o) 
ae ol [o) 
oO 4 
| | | | 
0 5 10 15 20 25 30 0 20000 60000 


family size Z10 
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In this example, the family size is rather variable, but the variability in Z 9 is 
enormous (note the range on the histogram from 0 to 60,000). Some statistics 
are: 


Proportion of samples extinct by generation 10: 0.436 


Summary of Zn: 


Min ist Qu Median Mean 3rd Qu Max 
¢) O 1003 4617 6656 82486 
Mean of Zn: 4617.2 


Variance of Zn: 53937785.7 


So the empirical variance is Var(Z19) = 5.39 x 10’. This perhaps contains 
more useful information than the mean value of 4617. The distribution of Z,, 
has 43.6% of zeros, but (when it is non-zero) takes values up to 82,486. Is it 
really useful to summarize such a distribution by the single mean value 4617? 


For interest, out of the 5000 simulations, there were only 35 (0.7%) that had a 
value for Z19 greater than 0 but less than 100. This emphasizes the “boom-or- 
bust” nature of the distribution of Z,,. 


Simulation 2: Y ~ Geometric(p = 0.5) 


We repeat the simulation above with a different value for p in the Geometric 
family size distribution: this time, p = 0.5. The family size distribution is 
therefore Y ~ Geometric(p = 0.5). 


Simulation Zo Z\ Zo Z3 Z4 Zs Z6 Zr Zs Zo Z10 


GOMOD ON DOT KP WN FR 

ee 
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This time, almost all the populations become extinct. We will see later that 
this value of p (just) guarantees eventual extinction with probability 1. 


The family size distribution, Y ~ Geometric(p = 0.5), and the results for 
Z\9 from 5000 simulations, are shown below. Family sizes are often zero, but 
families of size 2 and 3 are not uncommon. It seems that this is not enough 
to save the process from extinction. This time, the maximum population size 
observed for Z19 from 5000 simulations was only 56, and the mean and variance 
of Z19 are much smaller than before. 


Family Size, Y Z10 


0.6 
0.15 


0.4 


0.10 


0.2 
0.05 


0.0 


0 5 10 15 0 10 20 30 40 50 60 


0.0 


family size Z10 


Proportion of samples extinct by generation 10: 0.9108 


Summary of Zn: 
Min ist Qu) Median Mean 3rd Qu Max 
O O ) 0.965 O 56 


Mean of Zn: 0.965 
Variance of Zn: 19.497 


What happens for larger values of p? 


It was mentioned above that Y ~ Geometric(p = 0.5) just guarantees eventual 
extinction with probability 1. For p > 0.5, extinction is also guaranteed, and 
tends to happen quickly. For example, when p = 0.55, over 97% of simulated 
populations are already extinct by generation 10. 


6.5 Mean and variance of Z,, 


The previous section has given us a good idea of the significance and interpre- 
tation of E(Z,,) and Var(Z,,). We now proceed to calculate them. Both E(Z,,) 
and Var(Z,,) can be expressed in terms of the mean and variance of the family 
size distribution, Y . 


Thus, let E(Y) = py and let Var(Y ) = 0”. These are the mean and variance of the 


number of offspring of a single individual. 


Theorem 6.5: Let {Zo, 2, Z2,...} be a branching process with Zp = 1 (start with 
a single individual). Let Y denote the family size distribution, and suppose that 


E(Y) =p. Then 
Ei Z,) =. 
Proof: 
By page 121, Z, = Yi + Yo+...+ Yz,_, 1s arandomly stopped sum: 


a= > Y; 
i=1 
Thus, from Section 3.4 (page 62), 
E(Z,) = E(¥j) x E(Zp-1) 
= px E(Z,-1) 


= p{ulE(Zy2)} 
= WE(Zp-2) 


= pE(Z;) 


n 


= ph? x p 
= O 
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Examples: Consider the simulations of Section 6.4. 
1. Family size Y ~ Geometric(p = 0.3). Sou = E(Y) = ; = — = 2.33. 
Expected population size by generation n = 10 is: 
E(Zy9) = ps? = (2.33)"° = 4784. 


The theoretical value, 4784, compares well with the sample mean from 5000 
simulations, 4617 (page 126). 


2. Family size Y ~ Geometric(p = 0.5). Sou = E(Y) = 22 Sand 
Pp 


E(Z10) = Ta = ‘Ca =], 


Compares well with the sample mean of 0.965 (page 127). 


Variance of Z,, 


Theorem 6.5: Let {Zo, 2, Z2,...} be a branching process with Zp = 1 (start with 
a single individual). Let Y denote the family size distribution, and suppose that 
E(Y) = pw and Var(Y) = o?. Then 


on i o=1, 


out ( =) ied 3 (lor <1 
— 
Proof: 


Write V,, = Var(Z,,). The proof works by finding a recursive formula for V;,. 
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Using the Law of Total Variance for randomly stopped sums from Section 3.4 
(page 62), 


=> Var(Z,) = {E(¥)} x Var(Z,-1) + Var(Y%) x E(Zp-1) 
> Vy, = @V-1+07E(Z,-1) 
=< = e ae oe ala 


n—-1 


using E(Z,_1) = py” as above. 


Also, 
V, = Var(Z,) = Var(Y) = 0”. 


Find V,, by repeated substitution: 


Y= 0° 
Vo = WV +o07%% = po? + po? = por(1+p) 
ba ere = PO 


Vi peo? (L+pt+p +p’) 


| 

ps 
ie) 
os 
+ 

Q 
bo 

<= 
w 

| 


Completing the pattern, 


V,, — go (1 | LL ! ie San re 
n—1 
= re? Sa 
r=0 
n-12(l—H" 
= pools ; Valid for pw # 1. 
— 


(sum of first n terms of Geometric series) 
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When p=1: 
Vz = to? (941 +...41%") = oF n. 
———— 
n times 


Hence the result: 


on i hy 
Var(Z,,) = 
1— yy” 
vy (Ee) I i, im 


Examples: Again consider the simulations of Section 6.4. 


0.7 
1. Family size Y ~ Geometric(p = 0.3). So w= E(Y) = a a= 2 Oo: 
p 
je. gq OF 


if = 10 
Var(Z19) = 0°” (—} = 5.72 x 10”. 


Compares well with the sample variance from 5000 simulations, 5.39 x 10’ 


(page 126). 
re q 0.5 
2. Family size Y ~ Geometric(p = 0.5). So w=E(Y) === ae i; 
Pp 
0.5 : 
o? = Var(Y) = es = BE = 2. Using the formula for Var(Z,,) when j1 = 1, we 
Pp 2 


have: 
Var(Z19) = 0?n = 2 x 10 = 20. 


Compares well with the sample variance of 19.5 (page 127). 


so 
eee) 
ee 
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Revision: a branching process consists of reproducing individuals. 


e All individuals are independent. 
e Start with a single individual at time 0: Zp = 1. 
e Each individual lives a single unit of time, then has Y offspring and dies. 


e Let Z, be the siZe of generation n: the number of individuals born at 
time n. 


e The branching process is {Zp = 1, 21, Zo,...}. 


Branching Process Recursion Formula 


This is the fundamental formula for branching processes. Let G(s) = E(s7") 
be the PGF of Z,,, the population size at time n. Let G(s) = G(s), the PGF 
of the family size distribution Y, or equivalently, of Z,. Then: 


n times 


[eWhsreWanangaoramakiMakeuru = 33 


7.1 Extinction Probability 


One of the most interesting applications of branching processes is calculating 
the probability of eventual extinction. For example, what is the probability 
that a colony of cancerous cells becomes extinct before it overgrows the sur- 
rounding tissue? What is the probability that an infectious disease dies out 
before reaching an epidemic? What is the probability that a family line (e.g. 
for royal families) becomes extinct? 


It is possible to find several results about the probability of eventual extinction. 


Extinction by generation n 
The population is extinct by generation n if Z, = 0 
(no individuals at time n). 


If Z,, = 0, then the population 1s extinct 
for ever: Z, = 0 for allt > n. Extinction is Forever 


Definition: Define event FE, to be the event 
E, = {Z;, = 0} (event that the population is extinct by generation n). 


Note: fy C fy C Fo C Bg Chg Cc... 
This is because event E; forces E; to be true for all 7 > 7, so E; is a ‘part’ or 
subset of E’; for 7 > 7. 


Ultimate extinction 


At the start of the branching process, we are interested in the probability of ulti- 
mate extinction: the probability that the population will be extinct by generation 
n, for any value of n. 


We can express this probability in different ways: 


00 1.e. extinct by generation 0 or 
P(ultimate extinction) = P (U z.| extinct by generation 1 or 
n=0 extinct by generation 2 or... 


Or: P(ultimate extinction) = P (im En) . (i.e. P(extinct by generation co )). 
n—->Oo 
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Note: By the Continuity Theorem (Chapter 2), and because Ey C FE, C Fy C..., 
we have: 


P(ultimate extinction) = P (Jim E,) = lim P(E,). 


n> co n—-0o 


Thus the probability of eventual extinction is the limit as n — oo of the prob- 
ability of extinction by generation n. 


We will use the Greek letter Gamma (y) for the probability of extinction: think 
of Gamma for ‘all Gone’! 


= P(En) 


P(extinct by generation n). 


y = P(ultimate extinction). 


By the Note above, we have established that we are looking for: 


P(ultimate extinction) = y = lim yp. ! 


noo 


Extinction is Forever 


Theorem 7.1: Let y be the probability of ultimate extinction. Then 


7 1s the smallest non-negative solution of the equation 
G(s) = s, where G(s) is the PGF of the family size distribution, Y . 


To find the probability of ultimate extinction, we therefore: 


e find the PGF of family size, Y: G(s) = E(s*); 
e find values of s that satisfy G(s) = s; 


e find the smallest of these values that is > 0. This is the required value y. 


G(y) = 7, and ¥ is the smallest value > 0 for which this holds. 


Note: Recall that, for any (non-defective) random variable Y with PGF G(s), 


G1) =EQ") = So VP(Y =y) =S PY =y) = 1. 


So G(1) = 1 always, and therefore there always exists a solution for G(s) = s 
in (0, 1). 


The required value y is the smallest such solution > 0. 


Before proving Theorem 7.1 we prove the following Lemma. 
Lemma: Let y, = P(Z, =0). Then y, = G(yn-1). 
Proof: IfG,(s) is the PGF of Z,,, then P(Z,, = 0) = G,(0). (Chapter 4.) 


Son, = G,(0). Similarly, y,_1 = Gp_-1(0). 


Now G,,(0) = G(a(G(...G0) . .))) - G(Gn1(0)), 


n times 
Som =G(GraQ))=G(ma). Oo 


Proof of Theorem 7.1: We need to prove: 
(i) GO) =% 


(ii) y is the smallest non-negative value for which G(y) = 7. 
That is, if s > 0 and G(s) = s, then y < s. 


Proof of (i): 


From co overleaf, y= limy, = lim G(n-1) (by Lemma) 
no no 


G ( lim ‘n-1) (G is continuous) 
noo 


=" G(). 
So G(7) = 7, as required. 


e2R) 
an 
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Proof of (ii): 


First note that G(s) is an increasing function on |0, 1]: 


G(s) =E(s*) = }/s'P(Y =y) 
= Gis) = Sys! PY = y) 


=> G(s) > 0 for0<s<1,_ soG is increasing on (0, 1]. 


G(s) is increasing on [0,1] means that: 


8s, <s. => G(s,) < G(s) for any 51,52 €[0,1]. o& 


The branching process begins with Zp) = 1, so 
P(extinct by generation 0) = 7 = 0. 
At any later generation, y, = G(7%n—1) by Lemma. 
Now suppose that s > 0 and G(s) = s. Then we have: 
Ce = y<s (because yo = 0) 


> Gy) < G(s) (by &) 


1.€. 2 se 
= G(n) < G(s) (by #) 
Le: Vy <8 


Thus A 8 for all n. 


So if s > 0 and G(s) = s, then y = lim y < s. 
n—- Oo 
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Example 1: Let {% = 1,2, Z,...} be a branching process with family size 
distribution Y ~ Binomial(2, *). Find the probability that the process will 
eventually die out. 


Solution: 


Let G(s) = E(s*). The probability of ultimate extinction is y, where y is the 
smallest solution > 0 to the equation G(s) = s. 


For Y ~ Binomial(n, p), the PGF is G(s) = (ps + q)" (Chapter 4). 
So if Y ~ Binomial(2, +) then G(s) = (4s + 3). 


We need to solve G(s) = s: 


G(s) = (48+ 4)? = s 


1.2, 6 9 
T6° age or ag = § 
i 2. 10 oF _ 
16° iss tig = O 


0.0 0.5 1.0 1.5 
S) 
Trick: we know that G(1) = 1, so s = 1 has got to be a solution. Use this for a 


quick factorization. 
1 9 
Thus 
s= 1 
or 
65 = “ => s=9 


The smallest solution > 0 is s = 1. 


Thus the probability of ultimate extinction is y = 1. 


Extinction is definite when the family size 


distribution is Y ~ Binomial(2, +). 
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Example 2: Let {% = 1,2, Z,...} be a branching process with family size 
distribution Y ~ Geometric(). Find the probability that the process will 
eventually die out. 


Solution: 


Let G(s) = E(s’). Then P(ultimate extinction) = y, where y is the smallest 
solution > 0 to the equation G(s) = s. 


For Y ~ Geometric(p), the PGF is G(s) = —-- (Chapter 4). 


1—qs 


1/4 ih 
So if Y ~ Geometric(+) then G(s) = ae a aT 


We need to solve G(s) = s: 
G(s)=pe = 5 
Ag B57 =: 1. 


3s? -4s+1 = 0 


Trick: know that s = 1 1s a solution. 


(s —1)(3s—1) =0. ? 
Thus 


or 


The smallest solution > 0 is s = ‘. 


Thus the probability of ultimate extinction is y = 5. 


Extinction is possible but not definite when the 
family size distribution is Y ~ Geometric(+). 
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7.2 Conditions for ultimate extinction 


It turns out that the probability of extinction depends crucially on the value of 
jt, the mean of the family size distribution Y . 

Some values of js guarantee that the branching process will die out with prob- 
ability 1. Other values guarantee that the probability of extinction will be 
strictly less than 1. We will see below that the threshold value is p = 1. 


If the mean number of offspring per individual yz is more than 1 (so on average, 
individuals replace themselves plus a bit extra), then the branching process is 
not guaranteed to die out — although it might do. However, if the mean number 
of offspring per individual yu is 1 or less, the process is guaranteed to become 
extinct (unless Y = 1 with probability 1). The result is not too surprising 
for 4 > 1 or w < 1, but it is a little surprising that extinction is generally 
guaranteed if uw = 1. 


Theorem 7.2: Let {Zp = 1, 21, Z,...} be a branching process with family size 
distribution Y. Let = E(Y) be the mean family size distribution, and let + 
be the probability of ultimate extinction. Then 

(i) If > 1, then y < 1: extinction is not guaranteed if {1 > 1. 
(ii) If < 1, then y = 1: extinction is guaranteed if 1 < 1. 
(iii) If ~ = 1, then y = 1 unless the family size is always constant at Y = 1. 


Lemma: Let G(s) be the PGF of family size Y. Then G(s) and G"(s) are strictly 
increasing for 0 < s < 1, as long as Y can take values > 2. 


CO 


Proof: G(s) = E(s') = ys s#P(Y =y). 
(oe) y=0 
So G'(s) = Si ys! PY =%) >0 for 0 < s< 1, 
y=l 
because all terms are > 0 and at least 1 term is > 0 (if P(Y > 2) > 0). 
Similarly, G"(s) = So uly —1)s’P(Y = y) > 0 for0<s <1. 
y=2 


So G(s) and G'(s) are strictly increasing for 0 < s < 1. 0 
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Note: When G’(s) > 0 for 0 < s < 1, the function G is said to be convex on that 


interval. i ae 


Convex: G’’(s) > 0 Concave: G’’(s) < 0 


G(s) > 0 means that the gradient of G is constantly increasing for 0 < s < 1. 


Proof of Theorem 7.2: This is usually done graphically. 


The graph of G(s) satisfies the following conditions: 


1. G(s) is increasing and strictly convex (as long as Y can be > 2). 
2, G0) =P(Y =0) > 0. 

Boe Lady 

4. G'(1) = yp, so the slope of G(s) at s = 1 gives the value 1. 


5. The extinction probability y is the smallest value > 0 for which G(s) = s. 


u = gradient at 1 


/ t=s (gradient=1) 


P(Y=0) 


Y 
(extinction 
probability) 
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Case (i): u>1 
t ol 

When , > 1, the curve G(s) is 
forced beneath the linet = s at s = 1. 
The curve G(s) has to cross the 
line t = s again to meet the t-axis 
at P(Y = 0). P(Y=0) 
Thus there must be a solution y < 1 
to the equation G(s) = s. 0 


/ tzs (gradient=1) 


Case (ii): w <1 


When pu < 1, the curve G(s) is t 
forced above the line t = s for s < 1. 
There is no possibility for the curve 
G(s) to cross the line t = s again 
before meeting the t-axis. 

Thus there can be no solution < 1 
to the equation G(s) = s, so y = 1. 


pel 


t=s (gradient=1) 


The exception is where Y can take only 0 
values 0 and 1, so G(s) is not strictly 

convex (see Lemma). However, in that case 

G(s) = po + pis is a straight line, giving 

the same result y = 1. 


Case (iii): w= 1 


When pw = 1, the situation is the same t 


as for p< 1. ; 


The exception is where Y takes only the — pyy_o) 
value 1. Then G(s) = s for allO<s <1, 
so the smallest solution > 0 is y = 0. 


t=s (gradient=1) 


Thus extinction is guaranteed for u = 1, 
unless Y = 1 with probability 1. 


THE UNIVERSITY 
OF AUCKLAND 
NEW ZEALAND 
TeWhareWanangaoramakiMakeuru | 4° 


Example 1: Let {% = 1,21, Z,...} be a branching process with family size 
distribution Y ~ Binomial(2,+), as in Section 7.1. Find the probability of 
eventual extinction. 


Solution: 


Consider Y ~ Binomial(2, 5.) The mean of Y is 1 = 2 x 4 = 1 < 1. Thus, by 
Theorem 7.2, 
y = P(ultimate extinction) = 1. 


(The longer calculation in Section 7.1 was not necessary.) 


Example 2: Let {% = 1,2, Z,...} be a branching process with family size 
distribution Y ~ Geometric(4), as in Section 7.1. Find the probability of 
eventual extinction. 


Solution: 


Consider Y ~ Geometric(;.) The mean of Y is 1 = Ll = 3 > 1. Thus, by 
Theorem 7.2, 


y = P(ultimate extinction) < 1. 


To find the value of y, we still need to go through the calculation presented in 
Section 7.1. (Answer: 7 = 3.) 


Note: The mean p of the offspring distribution Y is known as the criticality pa- 
rameter. 


e If « < 1, extinction is definite (y = 1). The process is called subcritical. 
Note that E(Z,,) = up” > 0 as n> oo. 


e If « = 1, extinction is definite unless Y = 1. The process is called critical. 
Note that E(Z,,) = uw" = 1 Vn, even though extinction is definite. 


e If « > 1, extinction is not definite (y < 1). The process is called supercritical. 
Note that E(Z,,) = uw" > co as n > oo. 


irorIntivepeItry 
a ibe TERST) 
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: bec ; 
‘BAD Luck ! you population !S SCHEDULED 'to om Extinct! 


But how long have you got... ? 


7.3 Time to Extinction 


Suppose the population is doomed to extinction — or maybe it isn’t. Either way, 
it is useful to know how long it will take for the population to become extinct. 
This is the distribution of 7’, the number of generations before extinction. For 
example, how long do we expect a disease epidemic like SARS to continue? 
How long have we got to organize ourselves to save the kakapo or = SEEMS 
before they become extinct before our very eyes? : 


1. Extinction (by) time n 


The branching process is extinct by time n if Z, = 0. 


Thus the probability that the process has become extinct by time n is: 
P( 4, = 0) = G,(0) =p: 


Note: Recall that G,(s) = E(s") = G(a(a( eet ee ))) 
nm times 


There is no guarantee that the PGF G;,(s) or the value G,,(0) can be calculated 
easily. However, we can build up G,,(0) in steps: 


e.g. Go(0) = G(G(0)); then G3(0) = G(G2(0)), or even G'4(0) = Go(Go(0)). 
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2. Extinction (at) time n 


Let T' be the exact time of extinction. That is, T = n if generation n is the 
first generation with no individuals: 


L=) = 2y=0 AND Zoya > 0: 


Now by the Partition Rule, 
PZ, =0 OZ,4 > 0) + PZ, =0 02,4=0) =PZ,= 0). (x) 


But the event {Z,, =0 MZ,_1 = 0} is the event that the process is extinct by 
generation n — 1 AND it is extinct by generation n. However, we know it will 
always be extinct by generation n if it is extinct by generation n — 1, so the 
Zn = 0 part is redundant. So 


P(Z, =0 A Zp-1 = 0) = P(Zn-1 = 0) = Gy_1(0). 


Similarly, 


This gives the distribution of T’, the exact time at which extinction occurs. 


Example: Binary splitting. Suppose that the family size distribution is 


v= 0 with probability q = 1 — p, 
~ | 1 with probability p. 


Find the distribution of the time to extinction. 
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Solution: 
Consider 


G(s) = E(s’) =qs°+ps' = q+ps. 


Ga(s) = G(G(s)) =q+p(qtps) = q(l+p)+p’s. 


G3(s) = G(G2(s)) =q+plqtpqtp’s) = gilt+pt+p*)+p’s. 


Thus time to extinction, 7’, satisfies 
P(T=n) = G,(0) — Gy_1(0) 


| 
& 
Ss 
S 
ey 
3 

| 
— 
NS 


Thus 
T — 1~ Geometric(q). 


It follows that E(T’— 1) = £, so 


Note: The expected time to extinction, E(T), is: 
e finite if 1 < 1; 
e infinite if 4: = 1 (despite extinction being definite), if o? is finite; 
e infinite if j1 > 1 (because with positive probability, extinction never 


happens). 


(Results not proved here.) 
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7.4 Case Study: Geometric Branching Processes 


Recall that Gi,(s) = E(s2") = oe G(s). ) . 


In general, it is not possible to find a closed-form expression for G',(s). We 
achieved a closed-form G(s) in the Binary Splitting example (page 144), but 
binary splitting only allows family size Y to be 0 or 1, which is a very restrictive 
model. 


The only non-trivial family size distribution that allows us to find a closed-form 
expression for G',(s) is the Geometric distribution. 


When family size Y ~ Geometric(p), we can do the following: 


e Derive a closed-form expression for G',(s), the PGF of Z,. 


e Find the probability distribution of the exact time of extinction, 7’: 
not just the probability that extinction will occur at some unspecified time 


(7). 


e Find the full probability distribution of Z,,: probabilities P(Z,, = 0), 
P(Z,=1),P(Za=2),..- 


With Y ~ Geometric(p), we can therefore calculate just about every quantity 
we might be interested in for the branching process. 


1. Closed form expression for G(s) 


Theorem 7.4: Let {Zp = 1,21, Z2,...} be a branching process with family size 
distribution Y ~ Geometric(p). The PGF of Z,, is given by: 
n—(n—1)s 


if =ig= 0.5 
n+l-—ns mn P=4 ; 


(u" — 1) — p(w" * — 1)s 


oe h =e 
(WT 1) er — 1s | a 
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Proof (sketch): 


The proof for both p = q and p ¥ q proceed by mathematical induction. We 
will give a sketch of the proof when p = q = 0.5. The proof for p 4 q works in 
the same way but is trickier. 


Consider p = q = ‘. Then 


D 3 1 
G(s) = 2 , 
(s) [qs l=- 5 25 


Using the Branching Process Recursion Formula (Chapter 6), 
1 1 2—8 2—8 


CN = G(G ) a se eee . 
2() (9) = 3G) I Wasa 35% 
_ n—(n-—1)s : 
The inductive hypothesis is that G,,(s) = c—-————.,, and it holds for n = 1 
n+l—ns 


and n = 2. Suppose it holds for n. Then 


n—(n—-1)G(s) | n—(n—1)(s4) 
Gn+i(s) = Gn(G(s)) ~ n+l—nGs) nti —n(h) 


Therefore, if the hypothesis holds for n, it also holds for n +1. Thus the 
hypothesis is proved for all n. LO 
2. Exact time of extinction, T 
Let Y ~ Geometric(p), and let T’ be the exact generation of extinction. 
From Section 7.3, 
PUP =n) = P(Z, = 0) — P(Z,-1 = 0) = G,,(0) — Gna(0). 


By using the closed-form expressions overleaf for G,,(0) and G,_1(0), we can find 
P(T =n) for any n. 


3. Whole distribution of Z,, 
1 
From Chapter 4, P(Z, =r) = —G“ (0). 
r 


| n 


Now our closed-form expression for G,,(s) has the same format regardless of 
whether pp = 1 (p = 0.5), or p41 (p 40.5): 


A—Bs 
Gris) = 
o C'— Ds 
(For example, when p = 1, we have A= D=n, B=n-—1,C =n-+1.) Thus: 
A 
P(Z, = 0) =G,(0) = = 
(2 = 0) = G,(0) = 5 


Q2 
~~ 
Wa) 
Na 
| 
| 


: (C — Ds)? (C — Ds) 
=> P(Z,=1)= =G,(0) oo 
G(s) = (—2)(—D)(AD- BC) _ 2D(AD — BC) 


(C — Ds) ~~ (C— Dsys 


=> P(Z,=2)= 5G (0) = (=) (2) 


= P(Z,=r)= 5G(0) = (a) (c 


CD =] for p= 1, 2o365 


(Exercise) 


This is very simple and powerful: we can substitute the values of A, B,C, and 


D to find P(Z, = 1) or P(Z, <r) for any r and n. 


Note: A Java applet that simulates branching processes can be found at: 
http://www.dartmouth.edu/~chance/teaching_aids/books_articles/ 
probability_book/bookapplets/chapter10/Branch/Branch. html 
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Chapter 8: Markov Chains 


i Only masters winere yous ue. nas Where Youlus bao... 


8.1 Introduction 


So far, we have examined several stochastic processes using 
transition diagrams and First-Step Analysis. 

The processes can be written as {Xo,X1, X2,...}, 

where X; is the state at time t. 


A.A.Markov 
On the transition diagram, X; corresponds to 1856-1922 
which box we are in at step t. 


In the Gambler’s Ruin (Section 2.7), X; is the amount of money the gambler 
possesses after toss t. In the model for gene spread (Section 3.7), X; is the 
number of animals possessing the harmful allele A in generation t. 
The processes that we have looked at via the transition diagram have a crucial 
property in common: 

X+41 depends only on X;. 
It does not depend upon Xo, -Xy,..., Xp-1. 


Processes like this are called Markov Chains. 


Example: Random Walk (see Chapter 4) 


none of these steps matter for time t+1 


Kee, 
Ry ge DS 


In a Markov chain, the 


future depends only 


upon the present: 


NOT upon the past. 
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Meet... ite Markov fleas! .? 


29 


i . Glee- EY = Purpose- 


= “flea  Forget-flea flea — Skill-flea 


The text-book image 
of a Markov chain has i 
a flea hopping about at = 


random on the vertices = 
of the transition diagram, 
according to the probabilities shown. 


The transition diagram above shows a system with 7 possible states: 
state space S = {1,2,3,4,5,6, 7}. 
Questions of interest 


e Starting from state 1, what is the probability of ever reaching state 7? 

e Starting from state 2, what is the expected time taken to reach state 4? 

e Starting from state 2, what is the long-run proportion of time spent in 
state 3? 

e Starting from state 1, what is the probability of being in state 2 at time 
t? Does the probability converge as t — oo, and if so, to what? 


We have been answering questions like the first two using first-step analysis 
since the start of STATS 325. In this chapter we develop a unified approach 
to all these questions using the matrix of transition probabilities, called the 
transition matrix. 


8.2 Definitions 


The Markov chain is the process Xo, _X1, X9,.... 


Definition: The state of a Markov chain at time t is the value of X;. 
For example, if X; = 6, we say the process is in state 6 at time t. 

Definition: The state space of a Markov chain, S, is the set of values that each 
X;, can take. For example, S = {1,2,3,4,5,6, 7}. 
Let S have size N (possibly infinite). 


Definition: A trajectory of a Markov chain is a particular set of values for 
Xo, X1, X9,.... 


For example, if Xo = 1, X; = 5, and X» = 6, then the trajectory up to time 
f=72 64,5.6. 


More generally, if we refer to the trajectory 59, 51, 52, 53,..., we mean that 
Xo = So, X} = 81, X9 = 55, As = 8s, Slade 


‘Trajectory’ is just a word meaning ‘path’. 


Markov Property 


The basic property of a Markov chain is that only the most recent point in the 
trajectory affects what happens next. 


This is called the Markov Property. 
It means that X;,1 depends upon X;, but it does not depend upon X;_1,..., X1, Xo. 
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We formulate the Markov Property in mathematical notation as follows: 


P(Xi41 9 | Xt = 8, X¢-1 = St-1,.--, XO = So) = P(Xt41 = 8 | X; = St), 


for all ¢ = 1,2,3,... and for all states so, 51,..., 5;, s. 
Explanation: 

P(Xt41 = 2 | Xt, = 8, Xe = S71, Me = St MG = 8T, AO = 50 ) 

t ee 
distribution t 

of X441 depends si 
on X; but whatever happened before time t 
doesn’t matter. 


Definition: Let {Xo, Xi, X2,...} be a sequence of discrete random variables. Then 
{ Xo, X1, X2,...} is a Markov chain if it satisfies the Markov property: 


Pye sHsp a.ep tg = 50) a P(Xy41 = 8 | Xt = 8+), 


for all t = 1,2,3,... and for all states so, 51,..., 54, 8. 


8.3 The Transition Matrix 


We have seen many examples of transition diagrams to describe Markov 
chains. The transition diagram is so-called because it shows the transitions 
between different states. 


0.8 
FF 0.2 em: "(ead 0.4 Xt41 


Hot Cold 


We can also summarize the probabilities xX, { Hot ( 0.2 0.8 ) 
in a matrix: Cold \ 0.6 0.4 
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The matrix describing the Markov chain is called the transition matrix. 
It is the most important tool for analysing Markov chains. 


X41 


Transition Matrix 
———<—qxq_— mooa— _——-—=————————— 


list all states 


+— rows add to l 


list insert 
o€ all probabilities ~—rows add to 1 
states Dis 


The transition matrix is usually given the symbol P = (p;;). 


In the transition matrix P: 


e the ROWS represent NOW, or FROM (X;,); 
e the COLUMNS represent NEXT, or TO (X;+1); 


e entry (i,j) is the CONDITIONAL probability that NEXT = j, given that 
NOW = 1: the probability of going FROM state i TO state 7. 


Notes: 1. The transition matrix P must list all possible states in the state space S. 
2. Pisa square matrix (N x N), because X;4; and X; both take values in the 
same state space S' (of size NV). 


3. The rows of P should each sum to 1: 


N N N 


> Pi = So P(Xe1 a j|X = i) = So Puno ( Xe = 7) — a 


j=l j=l j=l 
This simply states that X+., must take one of the listed values. 


4. The columns of P do not in general sum to 1. 
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Definition: Let {Xo,X1, Xo,...} be a Markov chain with state space S, where S 
has size N (possibly infinite). The transition probabilities of the Markov 
chain are 


py = POG = 7 | Ar =) fori,j € S, 20 a ee 
Definition: The transition matrix of the Markov chain is P = (p;;). 


8.4 Example: setting up the transition matrix 


We can create a transition matrix for any of the transition diagrams we have 
seen in problems throughout the course. For example, check the matrix below. 
q 


Example: ‘Tennis game at Deuce. f pate 7 
Pp VENUS > VENUS 
LS a AHEAD (A) WINS (W) 


DEUCE (D) 
VENUS VENUS 
" BEHIND (B) |g > _ LOSES (L) 
D A B W OL 
D 0 p aq 0 0 ir 
Al«q 0 0 p 0 
Bl p 0 0 0 gq 
W 0 0 0 1 0 
L 0 0 0 0 1 
8.5 Matrix Revision (A) col j 
Notation rowi (------ . 
Let A be an N x N matrix. 
We write A = (aj;), N by ————- 


i.e. A comprises elements aj;;. 


The (i, 7) element of A is written both as a;; and (A);;: 
e.g. for matrix A? we might write (A’);;. 


Matrix multiplication ee ge 


a 
Let A = (a;;) and B = (8;;) ; 


be N x N matrices. 


The product matrix is A x B = AB, with elements (AB);; = So ainde;: 


Summation notation for a matrix squared 


Let A be an N x N matrix. Then 
N N 


(Aig = S-(A)in(A)ag = D— izcrny- 


Pre-multiplication of a matrix by a vector 
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k=1 


Let A be an N x N matrix, and let 7 be an N x1 column vector: 7 = 


We can pre-multiply A by 2” to get a 1 x N row vector, 
mA = ((w7A)1,...,(a7A)n), with elements 


N 
(7 A); = >: Tj Qi;- 
i=1 


8.6 The t-step transition probabilities 


Let {Xo, X1, X2,...} be a Markov chain with state space S = {1,2,..., N}. 


Recall that the elements of the transition matrix P are defined as: 


Pg =pe =P er=9 | XAo= 9) = PCa = 7 | An = 7) 


7 


TN 


for any n. 
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pi; is the probability of making a transition FROM state 7 TO state 7 in a 


SINGLE step. 


Question: what is the probability of making a transition from state 2 to state 7 


over two steps? Le. what is P(X2 = 7 | Xo =1)? 
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We are seeking P(X2 = j | Xo =1). Use the Partition Theorem: 
P(X, =7|X0 =i) = Pi(X2= J) (notation of Ch 2) 
N 
= S°Pi(X2= j| Xi =k)P)(Xi =k) (Partition Thm) 
k=1 
N 
= SOP(X) = j|X1 =k, Xo = 1)P(X1 = k| Xo = 1) 
k=1 
N 
= So P(X. =5|X = k)P(Xy =| Ao =4) 
k=1 


(Markov Property) 


N 
= S- Pig Dir (by definitions) 
k=1 


N 
= S| DikPhy (rearranging) 

k=l 
= (Py. (see Matrix Revision) 


The two-step transition probabilities are therefore given by the matrix P?: 


P(X = 5 | Xo=2) = PAs 3 [Xn = 2) = (LP) 


ij for any n. 


3-step transitions: We can find P(X3 = j| Xo = 7) similarly, but conditioning on 
the state at time 2: 


N 
P(X3 = j|Xo=1) = S > P(X3 =5| Xo =k)P(X2 = k| Xo = 2) 
k=1 


N 
do Phi Case 
k=1 


= (P*)iy. 
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The three-step transition probabilities are therefore given by the matrix P?: 


P X= 9 | Xo = 7) = Pas = 9 | An =D) = 0"). 


ij for any n. 


General case: t-step transitions 


The above working extends to show that the t-step transition probabilities are 
given by the matrix P* for any t: 


PG = 9 Xe S27) =P Cae = FA SO = (P 


ij for any n. 


We have proved the following Theorem. 


Theorem 8.6: Let {Xo,X 1, X2,...} be a Markov chain with N x N transition 
matrix P. Then the ¢-step transition probabilities are given by the matrix P%. 
That is, 

P(X; = j| Xo = 14) = (P’) 


aj’ 
It also follows that 


ige.eves 5p. eo) (P*) 5 for any n. O 


8.7 Distribution of X; 


Let {Xo, Xi, X2,...} be a Markov chain with state space S = {1,2,..., N}. 
Now each X; is a random variable, so it has a probability distribution. 
We can write the probability distribution of X; as an N x 1 vector. 


For example, consider Xo. Let 7 be an N x 1 vector denoting the probability 
distribution of X9: 


TY P(Xo = 1) 
9 P(Xo = 2) 


TN P(Xo = N) 
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In the flea model, this corresponds to the flea choosing at random which vertex 
it starts off from at time O, such that 


P(flea chooses vertex i to start) = 77. 


Notation: we will write Xj ~ 7’ to denote that the row vector of probabilities 


is given by the row vector 7. 


Probability distribution of X, 


Use the Partition Rule, conditioning on Xo: 
N 
P(X, =j3) = >> P(X, =5| Xo =1)P(X0 =4) 
i=1 
N 
S- pijt; by definitions 


t=1 


N 
S- MiPij 
i=l 

= (xP) 


(pre-multiplication by a vector from Section 8.5). 


a 


This shows that P(X, = j) = (w"P) ; forall j. 
The row vector +7 P is therefore the probability distribution of X,: 


Xp ~ wt 


Xj ~ nr! P. 


Probability distribution of X» 


Using the Partition Rule as before, conditioning again on Xo: 


P(X2 =) =} P(X2 = 5 | Xo = )P(Xo = 4) = D_ (P”),, mi = (WT P”),,. 
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The row vector a! P? is therefore the probability distribution of Xo: 


These results are summarized in the following Theorem. 


Theorem 8.7: Let {Xo,X 1, X2,...} be a Markov chain with N x N transition 
matrix P. If the probability distribution of Xo is given by the 1 x N row vector 
am’, then the probability distribution of X; is given by the 1 x N row vector 
a? P', That is, 

Xovnmw => Xena P 


Note: The distribution of X; is X;~ a? P*. 
The distribution of Xj41 is Xi4. ~ aw? Pot. 
Taking one step in the Markov chain corresponds to multiplying by P on the 
right. 


Note: The t-step transition matrix is P’ (Theorem 8.6). 
The (¢ + 1)-step transition matrix is P’*?. 
Again, taking one step in the Markov chain corresponds to multiplying by P on 
the right. 


take 1 step... 


multiply by P 
P= on the right 
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8.8 Trajectory Probability 


Recall that a trajectory is a sequence 
of values for Xo, Xj,..., X¢. 


Because of the Markov Property, 
we can find the probability of any 
trajectory by multiplying together 
the starting probability and all 
subsequent single-step probabilities. 


Example: Let Xo ~ (3, 0, ‘, 0,0,0,0). What is the probability of the trajectory 
1, 9.3.9, 3,4? 


P(1, 2,3,2,3,4) = P(Xo =1) x pig X po3 X p32 X p23 X P34 


ied 
x 2 xX x1xex1lx; 


Proof in formal notation using the Markov Property: 


Let Xp ~ w?. We wish to find the probability of the trajectory so, 81, 52,..., St 
P(Xo = 80, X1 = 51,.--, Xt = 5) 
= PX, = 8) | Xp = Ses oe 25 A = 90) K PUT = Se A= 90) 
= P(X; = 5; | X-1 = 54-1) X P(Xi-1 = $:-1,..., Xo = $9) (Markov Property) 
= Degg Np = Sod || a eg A = Op) PO Se 2 = Sh) 


= Pst-1,81 x Psp-2,81-1 xX... x Ps0,81 x P(X = So) 


= Pst-1,81 x Psp-2,81-1 xX... x Ps0,81 x TM s9° 
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8.9 Worked Example: distribution of X; and trajectory probabilities 


Purpose-flea zooms around 
the vertices of the transition 
diagram opposite. Let X; be 
Purpose-flea’s state at time t 
(CaO Aan 


(a) Find the transition matrix, P. 


0.6 0.2 0.2 
Answer: P= | 0.4 O 0.6 
0 0.8 0.2 


(b) Find P(X, =3| Xp =1). 


0.6 0.2 0.2 - +» 0.2 
PC =3|Ag=) =F"). = - oe oR - +» 0.6 
0.2 


0.6 x 0.24+. 0.2 x 0.64+ 0.2 x 0.2 
= UR. 


Note: we only need one element of the matrix P?, so don’t lose exam time by 
finding the whole matrix. 


(c) Suppose that Purpose-flea is equally likely to start on any vertex at time 0. 
Find the probability distribution of X. 


5). Weneed X,~7'P. 


| 
o— 
wile 
wile 
wile 
—"" 


Thus X; ~ (3,3, 4) and therefore X, is also equally likely to be 1, 2, or 3. 


(d) 
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Suppose that Purpose-flea begins at vertex 1 at time 0. Find the probability 
distribution of Xo. 


The distribution of Xo is now 7? = (1,0,0). We need X_ ~ m7 P?. 

0 0) (06 0.2 0.2\ /06 0.2 0.2 
0.4 0 0.6 0.4 0 0.6 
0 0.8 0.2 0 0.8 0.2 


a! P? 


0.4 0 0.6 


0 0.6.0.2 


(0.44 0.28 0.28). 


Thus P(X) =1) =0.44, P(X> = 2) = 0.28, P(X2 = 3) = 0.28. 


Note that it is quickest to multiply the vector by the matrix first: we don’t need to 
compute P? in entirety. 


Suppose that Purpose-flea is equally likely to start on any vertex at time 0. 
Find the probability of obtaining the trajectory (3, 2, 1, 1, 3). 


P(3, 2, 1, 1,3) 


P(Xo = 3) X P32 X P21 X P11 X P13 (Section 8.8) 
= $x0.8x 04x 0.6 x 0.2 
= W012, 
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8.10 Class Structure 


The state space of a Markov chain can be partitioned into a set of non-overlapping 
communicating classes. 


States 7 and 7 are in the same communicating class if there is some way of 
getting from state 7 to state 7, AND there is some way of getting from state 7 
to state 7. It needn’t be possible to get between 2 and j in a single step, but it 
must be possible over some number of steps to travel between them both ways. 


We write 14 7. 


Definition: Consider a Markov chain with state space S and transition matrix P, 
and consider states 7,7 € S. Then state 2 communicates with state 7 if: 
1. there exists some ¢ such that (P‘),, > 0, AND 
2. there exists some u such that (P");; > 0. 


Mathematically, it is easy to show that the communicating relation © is an 
equivalence relation, which means that it partitions the sample space S into 
non-overlapping equivalence classes. 


Definition: States 7 and 7 are in the same communicating class if 7 © 7: Le. if 
each state is accessible from the other. 


Every state is a member of exactly one communicating class. 


Example: Find the communicating ON 


classes associated with the 
transition diagram shown. CS hn 


Solution: 


\leesots qoyots 
State 2 leads to state 4, but state 4 does not lead back to state 2, so they are in 
different communicating classes. 


Definition: A communicating class of states is closed if it is not possible to leave 
that class. 


That is, the communicating class C’ is closed if p;; = 0 whenever 7 € C’ and 


jEC. 


Example: In the transition diagram above: 
e Class {1,2,3} is not closed: it is possible to escape to class {4,5}. 


e Class {4,5} is closed: it is not possible to escape. 


Definition: A state i is said to be absorbing, if the set {7} is a closed class. 


Definition: A Markov chain or transition matrix P is said to be irreducible if 
7 + j for alli,7 € S. That is, the chain 1s irreducible if the state space S' is a 
single communicating class. 


8.11 Hitting Probabilities 


We have been calculating hitting 
probabilities for Markov chains 
since Chapter 2, using First-Step 
Analysis. The hitting probability 
describes the probability that the 
Markov chain will ever reach some 
state or set of states. 


In this section we show how hitting 
probabilities can be written in a 
single vector. We also see a general 
formula for calculating the hitting 
probabilities. In general it is easier 
to continue using our own common 
sense, but occasionally the formula 
becomes more necessary. 


THE UNIVERSITY 
OF AUCKLAND 
NEW ZEALAND 
Te Whare Wananga oTamakiMakaurau == | (9) 


Vector of hitting probabilities 


Let A be some subset of the state space S. (A need not be a communicating 
class: it can be any subset required, including the subset consisting of a single 
state: e.g. A = {4}.) 


The hitting probability from state z to set A is the probability of ever reach- 
ing the set A, starting from initial state 7. We write this probability as h;,4. 
Thus 

hia = P(X € A for some t > 0| Xo = 1). 


Example: Let set A = {1,3} as shown. i. 
. * “Vs . ae ~; a 1 
The hitting probability for set A is: Qe (4 )-+(s) 


e 1 starting from states 1 or 3 0) % cock 
(We are starting in set A, so we hit it immediately);  ~>-” 


e (starting from states 4 or 5 
(The set {4,5} is a closed class, so we can never escape out to set A); 


e 0.3 starting from state 2 
(We could hit A at the first step (probability 0.3), but otherwise we move to 
state 4 and get stuck in the closed class {4,5} (probability 0.7).) 


We can summarize all the information from the example above in a vector of 


hitting probabilities: hia 1 
hoa 0.3 

ha=]| hga | = 1 

haa 0 

Asa 0 


Note: When A is a closed class, the hitting probability h;4 is called the absorption 
probability. 
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In general, if there are N possible states, the vector of hitting probabilities is 


hia P(hit A starting from state 1) 

h haa P(hit A starting from state 2) 
A= = 

ANA P(hit A starting from state N) 


Example: finding the hitting probability vector using First-Step Analysis 


Suppose {X; : t > 0} has the following transition diagram: 


1/2 1/2 


i) @ @ Gp 


1/2 1/2 


Find the vector of hitting probabilities for state 4. 


Solution: 


Let hj, = P(hit state 4, starting from state i). Clearly, 
hig = 0 
ha = 1 
Using first-step analysis, we also have: 
hog = shyr+5 x0 
hya = 5+ 5ho 
Solving, 
hya=$4+4($hy1) => has = 5. So also, ho, = 4h3a = 4. 
So the vector of hitting probabilities is 


ha= (0, ‘. é, Tis 
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Formula for hitting probabilities 
In the previous example, we used our common sense to state that hj4 = 0. 
While this is easy for a human brain, it is harder to explain a general rule that 
would describe this ‘common sense’ mathematically, or that could be used to 
write computer code that will work for all problems. 


Although it is usually best to continue to use common sense when solving 
problems, this section provides a general formula that will always work to find 
a vector of hitting probabilities hy. 


Theorem 8.11: The vector of hitting probabilities ha = (hia : i € S) is the 
minimal non-negative solution to the following equations: 


iL for i€ A, 
his = So pighja for 1 ¢ A. 


jes 


The ‘minimal non-negative solution’ means that: 


1. the values {h;4} collectively satisfy the equations above; 

2. each value h;4 is > 0 (non-negative); 

3. given any other non-negative solution to the equations above, say {g;,} 
where g;4 > 0 for all 7, then hi4 < g;4 for all 7 (minimal solution). 


Example: How would this formula be used to substitute for ‘common sense’ in 


the previous example? ye 86a 
The equations give: 1c) @ ® 41 
1 if 1=4, 2 1/2 
jes 
Thus, es 


hig = hy, unspecified! Could be anything! 
hos = shigt shea 
hy, = shoa+ sha = shor+5 
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Because hj, could be anything, we have to use the minimal non-negative value, 
which is hia =). 
(Need to check h,4 = 0 does not force h;, < 0 for any other i: OK.) 


The other equations can then be solved to give the same answers as before. OU 


Proof of Theorem 8.11 (non-examinable): 


1 for i€ A, 
jes Pig hjA for i¢ A. 


Consider the equations hj4 = > (x) 


We need to show that: 
(i) the hitting probabilities {h;4} collectively satisfy the equations (x); 
(ii) if {g;4} is any other non-negative solution to (x), then the hitting proba- 
bilities {h;4} satisfy hi4 < g;4 for all 7 (minimal solution). 
Proof of (i): Clearly, hj4 = 1 if 7 € A (as the chain hits A immediately). 


Suppose that 7 ¢ A. Then 
hia = P(X € A for some t > 1| Xo = 7) 
= ) P(X € A for some t > 1|.X1 = j)P(X1 = j | Xo = 1) 
jes 
(Partition Rule) 


= > gape (by definitions). 
jes 


Thus the hitting probabilities {h;4} must satisfy the equations (x). 
Proof of (ii): Let ni) = P(hit A at or before time t | Xo = 7). 


We use mathematical induction to show that ni < ga for all t, and therefore 
fa = lites ni must also be < gj,. 
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But because g;4 is non-negative and satisfies (x), . 
gia > 0 for all 2. 


So gia = no for all 2. 


The inductive hypothesis is true for time t = 0. 


Time t: Suppose the inductive hypothesis holds for time f, i.e. 


a nh <gja_ for all j. 
onsider 


nit) — P(hit A by time t + 1| Xo =i) 


= )  P(hit A by time ¢ + 1|.X) = j)P(X1 = j |X = 1) 
jes 
(Partition Rule) 
jes 


IA 


9 OjA Dij by inductive hypothesis 
jes 


= gia because {g;4} satisfies (x). 


Thus pier) < gia for all 7, so the inductive hypothesis is proved. 
By the Continuity Theorem (Chapter 2), hi4 = limp +. no 


So hia < gia as required. 
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8.12 Expected hitting times 


In the previous section we found 
the probability of hitting set A, 
starting at state 2. Now we study 
how long it takes to get from i 
to A. As before, it is best to solve 
problems using first-step analysis 


and common sense. However, a 
general formula is also available. 


Definition: Let A be a subset of the state space S. The hitting time of A is the 
random variable 74, where 


Ta = min{t > 0: X; € A}. 
T4 is the time taken before hitting set A for the first time. 


The hitting time 7’, can take values 0,1, 2,..., and co. 
If the chain never hits set A, then 74 = oo. 


Note: The hitting time is also called the reaching time. If A is a closed class, it 
is also called the absorption time. 


Definition: The mean hitting time for A, starting from state 7, is 


mia = E(T,4| Xo =1). 


Note: If there is any possibility that the chain never reaches A, starting from 2, 
i.e. if the hitting probability hi, < 1, then E(T'4| Xo = 7) = oo. 


Calculating the mean hitting times 


Theorem 8.12: The vector of expected hitting times m4 = (m4 : 7 € S) is the 
minimal non-negative solution to the following equations: 


0 for i€A, 


A= 1+ So piimya for 1 ¢ A. 
j¢A 


Proof (sketch): 


Consider the equations m4 = 


0 for 1€ A, 
1+ igaPiimja for 1¢€ A. 
We need to show that: 
(i) the mean hitting times {m,,4} collectively satisfy the equations (x); 
(ii) if {u;4} is any other non-negative solution to (x), then the mean hitting 


times {m;4} satisfy m4 < uj, for all 7 (minimal solution). 


We will prove point (i) only. A proof of (ii) can be found online at: 
http://www.statslab.cam.ac.uk/~james/Markov/ , Section 1.3. 


Proof of (i): Clearly, m4 = 0 if 7 € A (as the chain hits A immediately). 
Suppose that 7 ¢ A. Then 


mia = E(Ta4| Xo =7) 


= 14+) 0 E(T4|X1 = 5)P(X1 =5| Xo =4) 
jes 
(conditional expectation: take 1 step to get to state 7 
at time 1, then find E(74) from there) 


= 1+ », Mpa Di (by definitions) 
jes 


= 1+ So pymja, because m;4 = 0 for j € A. 


j¢A 


Thus the mean hitting times {m;4} must satisfy the equations (x). 


Example: Let {X; : t > 0} have the same transition diagram as before: 


io “12 


Starting from state 2, find the 1CK1) @ ® 4)D1 


expected time to absorption. 2 1/2 
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Solution: 


Starting from state i = 2, we wish to find the expected time to reach the set 
A = {1,4} (the set of absorbing states). 


Thus we are looking for m4 = mz<. 


0 if ie {1,4}, 
Now mia= 1+ > 0 piymja if i ¢ {1,4}. 
j¢A 
1/2 1/2 
Thus, iW ® @ @ 
m4 = 0 (because 1 € A) i2 1p 


maa = 0 (because 4 € A) 


moa = 1+ SMA =F SM3.A 

> ma = 1+ im3A 
mga = 1+ imo + Sma 

= 1+ 5m 
= 14 $ (+ 4m) 


3 3 

=> qIN3A = 5) 

= ipa = 2. 
Thus, 


1 
m2. A = Cr! ae 


The expected time to absorption is therefore E(T'4) = 2 steps. 


Example: Glee-flea hops around on a triangle. At each step he 
moves to one of the other two vertices at random. What is 
the expected time taken for Glee-flea to get from vertex 1 


to vertex 2? 


Solution: 


We wish to find my. 


0 it t=, 
j#2 
Thus 
mo = 0 
Wis. = {ak 1 = 
Q = 5122 + 532 = 
jes = Pad L 
322 = x22 + 519 
= 1+ $my9 
= 14+4(1+ $m) 


=> m3 = 2. 


Thus m2 = 1+ $msg = 2 steps. 


transition matrix, P = 


1+ $32. 


NIF © wleY 
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Chapter 9: Equilibrium 


In Chapter 8, we saw that if {Xo,X1, Xo,...} is 
a Markov chain with transition matrix P, then 
Xn~nw => Xin’ he. 


This raises the question: is there any distribution m such that a’ P = a1? 
If m7? P = 77, then 
X,~ nt Xup~n'P=rn' 


=> 
=> XnorwnwP=n7! 
=> 


Xig3~mP=n' 
=> 
In other words, if 7? P = m7, and X; ~ x7, then 
Xe~ Xp ~ Xe ~ Mtg... 


Thus, once a Markov chain has reached a distribution 7? such that 77 P = x7, 
it will stay there. 


If wm? P = 2", we say that the distribution 7? is an equilibrium distribution. 


Equilibrium means a level position: there is no more change in the distri- 
bution of X; as we wander through the Markov chain. 


Note: Equilibrium does not mean that the value of X;,; equals the value of X;. 
It means that the distribution of X;,, is the same as the distribution of X;: 


eg. PY = 1) = POG = 1) =a 


P(Xt41 = 2) = P(X; = 2) = 719, ec. 


In this chapter, we will first see how to calculate the equilibrium distribution 77. 


We will then see the remarkable result that many Markov chains automatically 
find their own way to an equilibrium distribution as the chain wanders through 
time. This happens for many Markov chains, but not all. We will see the 
conditions required for the chain to find its way to an equilibrium distribution. 
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9.1 Equilibrium distribution in pictures 


Consider the following 4-state Markov chain: 


0.0 0.9 0.1 0.0 0.1 
0.8 0.1 0.0 0.1 
0.0 0.5 0.3 0.2 
0.1 0.0 0.0 0.9 


PS 


Suppose we start at time 0 with 
Xo ~ (+ a +): so the chain is equally 


4949494 
likely to start from any of the four states. Here 
are pictures of the distributions of Xo, X1,...,X4: 0.9 


2 2 
oO oO 
a nu 
Oo Oo 
S S 
) i) 


So 
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 


The distribution starts off level, but quickly changes: for example the chain is 
least likely to be found in state 3. The distribution of X; changes between each 
t = 0,1, 2,3,4. Now look at the distribution of X; 500 steps into the future: 


P(X50 =2) P(X501=2) P(X502=2) P(X503=2) P(X504 = 2) 


+ : 
Oo 
2 2 
Oo oO 
a a 
Oo 
3 S 


1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 


0.4 

0.4 
0.3 

0.4 

0.4 


0. 
0. 
0.2 
0. 


0. 
0. 
0.1 
0. 


0.0 
0.0 
0.0 
0.0 


The distribution has reached a steady state: it does not change between 
t = 500, 501,...,504. The chain has reached equilibrium of its own accord. 


9.2 Calculating equilibrium distributions 


Definition: Let {Xo, X1,...} be a Markov chain with transition matrix P and state 
space S, where || = N (possibly infinite). Let a? be a row vector denoting 
a probability distribution on S: so each element 7; denotes the probability 
of being in state 7, and ae mu; = 1, where 7; > O for alli = 1,...,N. The 
probability distribution 77 is an equilibrium distribution for the Markov chain 
la P=7". 
That is, 7? is an equilibrium distribution if 


N 
(x? P), = >) mpi = 7 for all F = 1, .43,0N; 
i=1 


By the argument given on page 174, we have the following Theorem: 


Theorem 9.2: Let {Xo, X1,...} be a Markov chain with transition matrix P. Sup- 
pose that a? is an equilibrium distribution for the chain. If X; ~ a? for any t, 
then Xi4, ~ mw! for allr > 0. O 


Once a chain has hit an equilibrium distribution, it stays there for ever. 


Note: There are several other names for an equilibrium distribution. If a7 is an 
equilibrium distribution, it is also called: 


e invariant: it doesn’t change: 7’ P = x’; 


e stationary: the chain ‘stops’ here. 


Stationarity: the Chain Station 


a train station is where a train stops 


a workstation is where... ? 7 ? 


a stationary distribution is where a Markov chain stops 
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9.3 Finding an equilibrium distribution 


if 


Vector 7° is an equilibrium distribution for P if: 


l. xz?’ P=n'; 
2. m= l; 


3. 7%; > O forall. 


Conditions 2 and 3 ensure that a? is a genuine probability distribution. 
Condition 1 means that 7 is a row eigenvector of P. 


Solving w?P = 727 by itself will just specify mw up to a scalar multiple. 
We need to include Condition 2 to scale 7 to a genuine probability distribution, 
and then check with Condition 3 that the scaled distribution is valid. 


Example: Find an equilibrium distribution for the Markov chain below. 


0.0 0.9 0.1 0.0 
0.8 0.1 0.0 0.1 
0.0 0.5 0.3 0.2 
0.1 0.0 0.0 0.9 


Solution: 


Let wm? = (m7, m2, 73, 74). 
The equations are 7? P = 7? and m, + 72 +734+ 7 =1. 
0.0 0.9 0.1 0.0 

Tp rT Ls 0.8 0.1 0.0 0.1 ; 

WT = 7 711 79 72 TV. = (711 79 72 7 

EDN 80. FS (8 OB Dee 


0.1 0.0 0.0 0.9 
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879 +.1lm] = Tm] 
97, + .lmo+ .573 = To 
.lm,+ .373 = 73 
lm) + .2m73+ .9%, = ™% 


Also TM +o +73 +7 = 1. 


(3) > TW = 713 
Substitute in(2) = .9(773)+.573 = .9m 


=> 12 = —T73 


. . 68 
Substitutein(1) => .8& (Fr) + 1m = 773 


=> %™m4 = —T3 
; : 68 86 
Substitute allin(5) => 73 (7 -- rs +1+ >) = 1 
> 73 = 


226 


Overall: 


L 63-68 a 86 
vie a —- —_——- — —_ 
226 226’ 226’ 226 


(0.28, 0.30, 0.04, 0.38). 


This is the distribution the chain converged to in Section 9.1. 
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9.4 Long-term behaviour a 03 


In Section 9.1, we saw an example where the Markov 
chain wandered of its own accord into its equilibrium 
distribution: 

P(Xs00 = xr) P(X501 = r) P(Xs502 = a) P(X503 = xr) 


af + bs 
fo} 


at 
o 


0.4 


0.3 


0.0 
0.0 


0.0 01 0.2 0.3 
01 0.2 0.3 
01 0.2 0.3 


0.0 


So 
1 2 3 4 1 2 3 4 A 42s 228. 4, 1 2 3 4 1 2 3 4 


This will always happen for this Markov chain. In fact, the distribution it 
converges to (found above) does not depend upon the starting conditions: for 
ANY value of Xo, we will always have X; ~ (0.28, 0.30, 0.04, 0.38) ast > oo. 


What is happening here is that each row of the transition matrix P' converges 
to the equilibrium distribution (0.28, 0.30, 0.04, 0.38) as t > oo: 


0.0 0.9 0.1 0.0 0.28 0.30 0.04 0.38 

pe 0.8 0.1 0.0 0.1 Ee at 0.28 0.30 0.04 0.38 ae ee 
0.0 0.5 0.3 0.2 0.28 0.30 0.04 0.38 
Od. O40 -050: 0.9 0.28 0.30 0.04 0.38 


(If you have a calculator that can handle matrices, try finding P’ for t = 20 
and t = 30: you will find the matrix is already converging as above.) 


This convergence of P’ means that for large t, no matter WHICH state we start 
in, we always have probability 

e about 0.28 of being in State I after t steps; 

e about 0.30 of being in State 2 after t steps; 

e about 0.04 of being in State 3 after t steps; 


e about 0.38 of being in State 4 after t steps. 


Start at Xo = 2 Start at Xp = 4 


aan a 
| il ail 
we T we 
| 2 if State 4 ll 3 State 4 
Ww lle Ww 
p< i oe. State 2 be State 2 
wa | IWAN Farnese sasssossoseasesassesosusesssenosssossssesessesessaiessesesseess — Pr 2 SESSA SESE 
Bs ih State 1 Ps ra State 1 
f i 
ie 
et. State 3 eo] State 3 
0 20 40 60 80 100 0 20 40 60 80 100 
time, ¢ time, ¢ 


The left graph shows the probability of getting from state 2 to state k in t 
steps, as t changes: (P")2;, for k = 1, 2,3, 4. 


The right graph shows the probability of getting from state 4 to state k in t 
steps, as t changes: (P")4, for k = 1, 2,3, 4. 


The initial behaviour differs greatly for the different start states. 
The long-term behaviour (large t) is the same for both start states. 


However, this does not always happen. Consider the two-state chain below: 
1, 01 
C= @) i ( 1 0 ) 


As t gets large, P’ does not converge: 


10 oO. 7 1 0 01 
500 501 502 503 
= =(4 1) a =(4 5) - =(4 ‘ . =(4 ie 


For this Markov chain, we never ‘forget’ the initial start state. 
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General formula for P‘ 


We have seen that we are interested in whether P’ converges to a fixed matrix 
with all rows equal as t + oo. 


If it does, then the Markov chain will reach an equilibrium distribution that does 
not depend upon the starting conditions. 


The equilibrium distribution is then given by any row of the converged P'. 


It can be shown that a general formula is available for P’ for any t, based on 
the eigenvalues of P. Producing this formula is beyond the scope of this course, 
but if you are given the formula, you should be able to recognise whether P" is 
going to converge to a fixed matrix with all rows the same. 


Example 1: 


0.8 
: pa Oe Oe 
) 04 0.6 04 


0.6 


We can show that the general solution for P” is: 


raif(S t)-(4 t)ean4 


As t — co, (—0.4)' > 0, so 


NICO Nw 


NUS NS 


This Markov chain will therefore converge to the equilibrium distribution 7’ = 
(3, +) as t —> oo, regardless of whether the flea starts in state 1 or state 2. 


Exercise: Verify that 7? = (3, +) is the same as the result you obtain from solving 
the equilibrium equations: 7’ P = 7m? and 7; + 7m = 1. 
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Example 2: Purposeflea knows exactly what he is doing, so his probabilities are 
all 1: 


. QO 41 
= R (Hot) p=( )) 
il 


We can show that the general solution for P* is: 


Peal i )*Ca a )emy 


As t — oo, (—1)' does not converge to 0, so 


aa A ) if t is even, 


- 0 1 es 
P =( 1 0 ) if t is odd, 
for all t. 


In this example, P' never converges to a matrix with both rows identical as t gets 
large. The chain never ‘forgets’ its starting conditions as t > oo. 


Exercise: Verify that this Markov chain does have an equilibrium distribution, 
— (5, $). However, the chain does not converge to this distribution as 
Lk OO, 


These examples show that some Markov chains forget their starting conditions 
in the long term, and ensure that X; will have the same distribution as t + oo 
regardless of where we started at Xo. However, for other Markov chains, the 
initial conditions are never forgotten. In the next sections we look for general 
criteria that will ensure the chain converges. 
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Target Result: 


e If a Markov chain is trreducible and aperiodic, and if an equilibrium 
distribution a?’ exists, then the chain converges to this distribution as 
t + oo, regardless of the initial starting states. 


To make sense of this, we need to revise the concept of trreducibility, and 
introduce the idea of apertodicity. 


9.5 Irreducibility 


Recall from Chapter 8: 

Definition: A Markov chain or transition matrix P is said to be irreducible if 
«+ j for alli,7 € S. That is, the chain is irreducible if the state space S' is a 
single communicating class. 


An irreducible Markov chain consists of a single class. 


Q 
One 
S =) D- 


Irreducible Not irreducible 


Irreducibility of a Markov chain is important for convergence to equilibrium as 
t — co, because we want the convergence to be independent of start state. 


This can happen if the chain is irreducible. When the chain is not irreducible, 
different start states might cause the chain to get stuck in different closed 
classes. In the example above, a start state of Xo = 1 means that the chain is 
restricted to states 1 and 2 as t + oo, whereas a start state of X9 = 4 means 
that the chain is restricted to states 4 and 5 as t > oo. A single convergence 
that ‘forgets’ the initial state is therefore not possible. 


THE UNIVERSITY 
OF AUCKLAND 
NEW ZEALAND 
Te Whare Wananga oTamakiMakaurau | &4 


9.6 Periodicity 


1 0 


1 
_ ' R (Hot), Suppose that Xo = 1. 
1 


Then xX; = 1 for all even values of t, and X; = 2 for all odd values of t. 


Consider the Markov chain with transition matrix P = ( ae. ) ; 


This sort of behaviour is called periodicity: the Markov chain can only return 
to a state at particular values of t. 


Clearly, periodicity of the chain will interfere with convergence to an equilibrium 
distribution as t + oo. For example, 


1 for even values of t, 
POS =1|45=1)= 
0 for odd values of t. 


Therefore, the probability can not converge to any single value as t — oo. 


Period of state 2 


To formalize the notion of periodicity, we define the period of a state 2. 
Intuitively, the period 1s defined so that the time taken to get from state 1 back to 
state 1 again is always a multiple of the period. 


In the example above, the chain can return to state 1 after 2 steps, 4 steps, 6 
steps, § steps, ... 


The period of state 1 is therefore 2. 
In general, the chain can return from state 7 back to state 2 again in ¢ steps if 
(P'),, > 0. This prompts the following definition. 
Definition: The period d(i) of a state 7 is 
d(i) = ged{t : (P"),, > 0}, 


the greatest common divisor of the times at which return 1s possible. 


Definition: The state i is said to be periodic if d(i) > 1. 


For a periodic state 7, (P‘),, = 0 if t is not a multiple of d(¢). 


Definition: The state 7 is said to be aperiodic if d(i) = 1. 


If state 7 is aperiodic, it means that return to state i 1s not limited only to regularly 
repeating times. 


For convergence to equilibrium as t > oo, we will be interested only in aperiodic 
states. 


The following examples show how to calculate the period for both aperiodic 
and periodic states. 


Examples: Find the periods of the given states in the following Markov chains, 
and state whether or not the chain is irreducible. 


1. The simple random walk. 


| 
l—p l-—p l—p l-—p l—p 
a(0) = ged{2,.4,6,..;}= 2. 


Chain 1s irreducible. 


eee) 
am 
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al) = eed) 2,34... = 1, 


Chain 1s irreducible. 


O=6 a1) = ged 2,4,6,....)=2. 
= i Chain is irreducible. 


al eed{ 2.4.0 ..24pS 2. 
Chain is NOT irreducible (i.e. Reducible). 


) 


dL ged{ 2.450.024.) 1, 


Chain 1s irreducible. 


2 
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9.7 Convergence to Equilibrium 


We now draw together the threads of the previous sections with the following 


results. 


Fact: If7< 7, then i and 7 have the same period. (Proof omitted.) 


This leads immediately to the following result: 


If a Markov chain is trreducible and has one aperiodic state, 


then all states are aperiodic. 


We can therefore talk about an irreducible, aperiodic chain, meaning that 
all states are aperiodic. 


Theorem 9.7: Let {Xo, Xi,...} be an irreducible and aperiodic Markov chain 
with transition matrix P. Suppose that there extsts an equilibrium distribution 
am’. Then, from any starting state i, and for any end state j, 


P(X; =j| Xo =1) 9 79; ast > oo. 


In particular, 
(P');, +m; ast — oo, for alli and J, 


so P' converges to a matrix with all rows identical and equal to 7’. UO 


For an irreducible, aperiodic Markov chain, 
with finite or infinite state space, 


the existence of an equilibrium distribution 7? ensures 


that the Markov chain will converge to 7’ ast > oo. 
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Note: If the state space is infinite, it is not guaranteed that an equilibrium distri- 
bution 7” exists. See Example 3 below. 


Note: If the chain converges to an equilibrium distribution m7 as t > oo, then the 
long-run proportion of time spent in state k 1s 7. 


9.8 Examples 


A typical exam question gives you a Markov chain on a finite state space and 
asks if it converges to an equilibrium distribution as t > oo. An equilibrium 
distribution will always exist for a finite state space. You need to check whether 
the chain is irreducible and aperiodic. If so, it will converge to equilibrium. 
If the chain is irreducible but periodic, it cannot converge to an equilibrium 
distribution that is independent of start state. If the chain is reducible, it may 
or may not converge. 


The first two examples are the same as the ones given in Section 9.4. 


Example 1: State whether the Markov chain below converges to an equilibrium 
distribution as t > oo. 


0.8 
0.2 0.8 
GR OHH Po ( O08) 


0.6 


The chain 1s irreducible and aperiodic, and an equilibrium distribution will exist 
for a finite state space. So the chain does converge. 


(From Section 9.4, the chain converges to 7’ = (2,4) as t + 00.) 
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Example 2: State whether the Markov chain below converges to an equilibrium 
distribution as t > oo. 


r-(2 3) 


The chain 1s irreducible, but it is NOT aperiodic: period = 2. 


Thus the chain does NOT converge to an equilibrium distribution as t — oo. 


It is important to check for aperiodicity, because the existence of an equilibrium 
distribution does NOT ensure convergence to this distribution if the matrix is 
not aperiodic. 


Example 3: Random walk with retaining barrier at 0. 
D D D D D 
(oO) @ @ @ & 
qd q qd q q 


Find whether the chain converges to equilibrium as t — oo, and if so, find the 
equilibrium distribution. 


The chain 1s irreducible and aperiodic, so if an equilibrium distribution exists, 
then the chain will converge to this distribution as t + oo. 


However, the chain has an infinite state space, so we cannot guarantee that an 
equilibrium distribution exists. 


Try to solve the equilibrium equations: 


NEW ZEALAND 


— P= a" and > 4% = 1. 


q p00 quo + q7™1 = To (x) 
¢ 0p 0 pq + qt. = 7 
P= gO p pm +qm™3 = 2 


pTp1+ qth. = tm for k=1,2,... 


From (x), we have pro = qm, 


P 
SO 7, = —-T9 


k 
We suspect that 7, = (2) ™. Prove by induction. 


k 
The hypothesis is true for k = 0,1, 2. Suppose that 7, = (2) m™. Then 
q 


iL 
Akt = re — ptr-1) 


| 
ae... 
Q 19 
ae 
> 
ib 
4 
= 


k 
The inductive hypothesis holds, so 7, = (4) ty forallk > 0. 
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ee) (oe) k 
We now need a: =, Le 7 > (=) a 
qd 
k=0 


i=0 


The sum is a Geometric series, and converges only for 


1 
To 1 B = => es ee 
mer q 
qd 


If p > q, there is no equilibrium distribution. 


| < 1. Thus when p < q, 
q 


we have 


Solution: 


If p < q, the chain converges to an equilibrium distribution 7, where 7, = 
k 
(1-2) (2) fork =0,1,.... 
q q 


If p > q, the chain does not converge to an equilibrium distribution as t + oo. 


Example 4: Sketch of Exam Question 2006. 
Consider a Markov chain with transition diagram: 


(a) Identify all communicating classes. 
For each class, state whether or not 
it is closed. 

Classes are: 
{1}, {2}, {3} (each not closed); 
{4} (closed). 


State whether the Markov chain is 
irreducible, and whether or not all states are 
aperiodic. 


S 


Not irreducible: there are 4 classes. 
All states are aperiodic. 


(c) The equilibrium distribution is 77 = (0,0,0,1). Does the Markov chain 
converge to this distribution as t + oo, regardless of its start state? 


Yes, it clearly will converge to 7? = (0,0, 0,1), despite failure of irreducibility. 
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Note: Equilibrium results also exist for chains that are not aperiodic. Also, states 


a9 


can be classified as transient (return to the state is not certain), null recurrent 
(return to the state is certain, but the expected return time is infinite), and 
positive recurrent (return to the state is certain, and the expected return 
time is finite). For each type of state, the long-term behaviour is known: 


e If the state & is transient or null-recurrent, 
P(X; = k | Xo = k) = (P"),, 9 0 as t > 00. 


e If the state is positive recurrent, then 
P(X; = k| Xo = k) = (P"),,, > mm as t > 00, where 7 > 0. 


The expected return time for the state is 1/7,. 


A detailed treatment is available at 
http://www.statslab.cam.ac.uk/~james/Markov/. 


Special Process: the Two-Armed Bandit 


A well-known problem in probability is called the two-armed 
bandit problem. The name is a reference to a type of gambling 
machine called the two-armed bandit. The two arms of the 
two-armed bandit offer different rewards, and the gambler 

has to decide which arm to play without knowing which 

is the better arm. 


A similar problem arises when doctors are experimenting with 
two different treatments, without knowing which one is better. Onatmed bandit 
Call the treatments A and B. One of them is likely to be better, but we don’t 
know which one. A series of patients will each be given one of the treatments. 
We aim to find a strategy that ensures that as many as possible of the patients 
are given the better treatment — though we don’t know which one this is. 


Suppose that, for any patient, treatment A has P(success) = a, and treatment 
B has P(success) = {, and all patients are independent. Assume that 0 < a < 1 
and0< 6 <1. 
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First let’s look at a simple strategy the doctors might use: 


e The random strategy for allocating patients to treatments A and B is 
to choose from the two treatments at random, each with probability 0.5, 
for each patient. 


e Let pr be the overall probability of success for each patient with the 
random strategy. Show that pr = $(a +P). 


The two-armed bandit strategy is more clever. For the first patient, we 
choose treatment A or B at random (probability 0.5 each). If patient n is given 
treatment A and it is successful, then we use treatment A again for patient n+1, 
for alln = 1,2,3,.... If A is a failure for patient n, we switch to treatment B 
for patient n+ 1. A similar rule is applied if patient n is given treatment B: if 
it is successful, we keep B for patient n+ 1; if it fails, we switch to A for patient 
tar Le 


Define the two-armed bandit process to be a Markov chain with state space 
{(A,S), (A, F), (B, S),(B, F)}, where (A,S) means that patient n is given 
treatment A and it is successful, and so on. 


Transition diagram: 


Exercise: Draw on the missing arrows and find their probabilities in terms of 
qa and £. 


(A,S) (B,F) 


(A,F) (B,S) 


Transition matrix: 
—_ AS AF BS BF 


AS 
AF 
BS 
BF 
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Probability of success under the two-armed bandit strategy 


Define pr to be the long-run probability of success using the two-armed bandit 
strategy. 


Exercise: Find the equilibrium distribution aw for the two-armed bandit pro- 
cess. Hence show that the long-run probability of success for each patient under 


this strategy is: 
_ Or 8 = 2a 


Pr = 2—a—B 


Which strategy is better? 


Exercise: Prove that pr — pr > 0 always, regardless of the values of a and (3. 


This proves that the two-armed bandit strategy is always better than, or equal 
to, the random strategy. It shows that we have been able to construct a strategy 
that gives all patients an increased chance of success, even though we don’t know 
which treatment is better! 


P(success) for different B when a=0.7 


a 
2 | 
oO 
go | 
ne) 
3 
at | 
TO 
ON 
is — Two-armed Bandit strategy 
| --- Random strategy 


0.0 0.2 0.4 0.6 0.8 1.0 
B 


The graph shows the probability of success under the two different strategies, 
for a = 0.7 and for 0 < 6 < 1. Notice how pr > pp for all possible values of (3. 


